Need a specific sort of string modification. Can someone help?

Discussion in 'Python' started by Sia, Jan 5, 2013.

  1. Sia

    Sia Guest

    I have strings such as:

    tA.-2AG.-2AG,-2ag
    or
    ..+3ACG.+5CAACG.+3ACG.+3ACG

    The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:

    tA..,
    and
    ....

    How can I do that?
    Thanks.
    Sia, Jan 5, 2013
    #1
    1. Advertising

  2. On 05/01/2013 10:35, Sia wrote:
    > I have strings such as:
    >
    > tA.-2AG.-2AG,-2ag
    > or
    > .+3ACG.+5CAACG.+3ACG.+3ACG
    >
    > The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
    >
    > tA..,
    > and
    > ...
    >
    > How can I do that?
    > Thanks.
    >


    Here is a naive solution (I am sure there are more elegant ones) -

    def strip(old_string):
    new_string = ''
    max = len(old_string)
    pos = 0
    while pos < max:
    char = old_string[pos]
    if char in ('-', '+'):
    num_pos = pos+1
    num_str = ''
    while old_string[num_pos].isdigit():
    num_str += old_string[num_pos]
    num_pos += 1
    pos = num_pos + int(num_str)
    else:
    new_string += old_string[pos]
    pos += 1
    return new_string

    It caters for the possibility that the number following the +/- could be
    greater than 9 - I don't know if you need that.

    It works with your examples, except that the second one returns '....',
    which I think is correct - there are 4 dots in the original string.

    HTH

    Frank Millman
    Frank Millman, Jan 5, 2013
    #2
    1. Advertising

  3. On Sat, Jan 5, 2013 at 7:35 PM, Sia <> wrote:
    > I have strings such as:
    >
    > tA.-2AG.-2AG,-2ag
    > or
    > .+3ACG.+5CAACG.+3ACG.+3ACG
    >
    > The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
    >
    > tA..,
    > and
    > ...
    >
    > How can I do that?


    Interesting. Are you guaranteed that there are no other plus or minus
    signs? Is the number after the sign just one digit? Assuming the
    answers to both are "yes", here's a one-liner:

    s=".+3ACG.+5CAACG.+3ACG.+3ACG"

    result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

    Split on either - or +, then trim off that many characters from each
    (that's the list comp), then join them back into a string.

    ChrisA
    Chris Angelico, Jan 5, 2013
    #3
  4. Sia

    Roy Smith Guest

    In article <>,
    Sia <> wrote:

    > I have strings such as:
    >
    > tA.-2AG.-2AG,-2ag
    > or
    > .+3ACG.+5CAACG.+3ACG.+3ACG


    Some kind of DNA binding site?

    A couple of questions. Are the numbers always single digits? How much
    data is there? Are we talking a few hundred 20-character strings, or
    all of Genbank?

    > The plus and minus signs are always followed by a number (say, i). I want
    > python to find each single plus or minus, remove the sign, the number after
    > it and remove i characters after that. So the two strings above become:
    >
    > tA..,
    > and
    > ...


    If I follow your description properly, the last output should be "...."
    (4 dots), right? This looks like it should work. I'm sure there's more
    efficient ways to do it, but for small inputs, this should be ok.˜

    The general pattern here is a state machine. It's a good pattern to
    learn if you're going to be doing any kind of sequence analysis. See,
    for example, http://en.wikipedia.org/wiki/State_machine.

    # Build up the new string as a list (for efficiency)
    new = []

    # Keep track of what state we're in. The three possible states
    # are 1) scanning for a region to be deleted, 2) looking for the
    # number, and 3) reading past the letters to be deleted.
    SCANNING = 1
    NUMBER = 2
    DROPPING = 3
    state = SCANNING

    # If we are in state DROPPING, dropcount is the number of
    # letters remaining to be dropped.
    dropcount = 0

    old = '.+3ACG.+5CAACG.+3ACG.+3ACG'
    for c in old:
    if state == SCANNING:
    if c in '+-':
    state = NUMBER
    else:
    new.append(c)

    elif state == NUMBER:
    # Assume the counts are all single digits. If not, then
    # we'll need a 4th state for accumulating the digits.
    dropcount = int(c)
    state = DROPPING

    else:
    assert state == DROPPING
    dropcount -= 1
    if dropcount == 0:
    state = SCANNING

    print ''.join(new)
    Roy Smith, Jan 5, 2013
    #4
  5. Sia

    Roy Smith Guest

    In article <>,
    Chris Angelico <> wrote:

    > result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])


    That's exceedingly clever. But bordering on line noise. At the very
    least, I would break it up into a couple of lines to make it easier to
    understand (plus you can print out the intermediate values to see what's
    going on):

    chunks = ("0"+s).replace("-","+").split("+")
    result = "".join([x[int(x[0])+1:] for x in chunks]
    Roy Smith, Jan 5, 2013
    #5
  6. On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <> wrote:
    > In article <>,
    > Chris Angelico <> wrote:
    >
    >> result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

    >
    > That's exceedingly clever. But bordering on line noise. At the very
    > least, I would break it up into a couple of lines to make it easier to
    > understand (plus you can print out the intermediate values to see what's
    > going on):
    >
    > chunks = ("0"+s).replace("-","+").split("+")
    > result = "".join([x[int(x[0])+1:] for x in chunks]


    Sure. You can always split a one-liner to taste, doesn't much matter
    where. You could split it majorly into half a dozen lines if you want,
    doesn't make a lot of diff. It'll work the same way :)

    ChrisA
    Chris Angelico, Jan 5, 2013
    #6
  7. Sia

    Roy Smith Guest

    In article <>,
    Chris Angelico <> wrote:

    > On Sun, Jan 6, 2013 at 1:30 AM, Roy Smith <> wrote:
    > > In article <>,
    > > Chris Angelico <> wrote:
    > >
    > >> result = "".join([x[int(x[0])+1:] for x in
    > >> ("0"+s).replace("-","+").split("+")])

    > >
    > > That's exceedingly clever. But bordering on line noise. At the very
    > > least, I would break it up into a couple of lines to make it easier to
    > > understand (plus you can print out the intermediate values to see what's
    > > going on):
    > >
    > > chunks = ("0"+s).replace("-","+").split("+")
    > > result = "".join([x[int(x[0])+1:] for x in chunks]

    >
    > Sure. You can always split a one-liner to taste, doesn't much matter
    > where.


    Well, there are better and worse places to split things. Consider these
    three statements:

    result = f1().f2().f3().f4().f5()
    result = f1(f2.f3(f4().f5()))
    result = f1(f2(3(f4(f5()))))

    Same number of function calls, but the middle one is clearly the most
    difficult to understand. The reason is because the scan direction keeps
    changing. The first one is strictly left-to-right. The last one is
    strictly inside-to-outside. The middle one is all over the place.

    That's why I chose to split this where I did. It was where the scan
    direction changed.

    Readability matters.
    Roy Smith, Jan 5, 2013
    #7
  8. On Sun, Jan 6, 2013 at 2:03 AM, Roy Smith <> wrote:
    > That's why I chose to split this where I did. It was where the scan
    > direction changed.


    Ah, good point. In any case, this is a fairly simple and clear way of
    doing things; it may or may not run faster than the explicit state
    machine, but IMHO it's a lot clearer to read a split() than something
    that changes state when a particular character is found.

    ChrisA
    Chris Angelico, Jan 5, 2013
    #8
  9. Sia

    Roy Smith Guest

    In article <>,
    Chris Angelico <> wrote:

    > it may or may not run faster than the explicit state machine,


    Hmmm, hard to say. Both are O(n), but yours makes several more copies
    of the data than mine (the string addition, the replace(), the split(),
    the string slicing). We both make copies as we put fragments into a
    list, and when we do the join().

    On the other hand, yours does all that inside the library, which is
    generally faster than random python code.

    Let's see:

    My way:
    real 0m2.097s
    user 0m2.078s
    sys 0m0.016s

    Your way:
    real 0m0.649s
    user 0m0.622s
    sys 0m0.025s

    You got me by a factor of 3 or 4. Not bad. And not terribly
    surprising. Once you've got things to the same O() group, pushing as
    much data manipulation as you can down into the library is usually a
    pretty effective optimization technique.

    > but IMHO it's a lot clearer to read a split() than something
    > that changes state when a particular character is found.


    Maybe. But being familiar with state machines is still a handy skill.
    DNA sequence analysis has lots of problems like "find a start codon
    which is within about 50 bases of a binding site, and then copy
    everything up until you find a stop codon". Things like that often map
    well to state machines. Especially if you're trying to do it in
    parallel in all three reading frames.
    Roy Smith, Jan 5, 2013
    #9
  10. On Sun, Jan 6, 2013 at 2:38 AM, Roy Smith <> wrote:
    > In article <>,
    > Chris Angelico <> wrote:
    >
    >> it may or may not run faster than the explicit state machine,

    >
    > You got me by a factor of 3 or 4. Not bad.


    You miss my point, though. I went for simple Pythonic code, and never
    measured its performance, on the expectation that it's "good enough".
    Written in C, the state machine is probably WAY faster than splitting
    and then iterating. My C++ MUD client uses code similar to that to
    parse TELNET and ANSI codes from a stream of bytes in a socket (and
    one of its "states" is that there's no more data available, so wait on
    the socket); the rewrite in a high level language divides the string
    on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
    then afterward splits on "\n" to divide into lines. The code's much
    less convoluted, it's easier to test different parts (because I can
    simply call the ANSI parser with a block of text), and on a modern
    computer, you can't see the performance difference (since you spend
    most of your time waiting for socket data anyway).

    But it's gratifying to know that the obvious and brief way to do
    things is fast too :)

    >> but IMHO it's a lot clearer to read a split() than something
    >> that changes state when a particular character is found.

    >
    > Maybe. But being familiar with state machines is still a handy skill.
    > DNA sequence analysis has lots of problems like "find a start codon
    > which is within about 50 bases of a binding site, and then copy
    > everything up until you find a stop codon". Things like that often map
    > well to state machines. Especially if you're trying to do it in
    > parallel in all three reading frames.


    Sure. And if you're working with petabytes of data, these
    considerations become fairly important. When that happens, you start
    rewriting your algorithms in C, or using Cython, or something; at very
    least, you start rewriting clear and simple algorithms into more
    complex ones. But all this happens *after* the code has been tested
    and proven. All the rewrites can be verified as being identical to
    their reference implementations; you can test one piece at a time as
    you change them. It's ever so much easier to work that way.

    <anecdote>At work, we had one employee whose code was, shall we say,
    less than stellar. At one point, he embarked on a months-long rewrite
    of one of his modules; meanwhile, I was unable to adequately test code
    that called on it. Once the rewrite was finally complete, I discovered
    myriad bugs in my own code, ones that would have been found and fixed
    instantly if I'd had even a slow version of the code to work against.
    Starting with something you can easily debug helps enormously with
    that, because debugging doesn't demand mega-TPS throughput.</anecdote>

    ChrisA
    Chris Angelico, Jan 5, 2013
    #10
  11. Sia

    Tim Chase Guest

    On 01/05/13 02:35, Sia wrote:
    > I have strings such as:
    >
    > tA.-2AG.-2AG,-2ag
    > or
    > .+3ACG.+5CAACG.+3ACG.+3ACG
    >
    > The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
    >
    > tA..,
    > and
    > ...


    With the same caveat as Frank posted about the second one being
    "...." (4 dots), I don't know how this version times out:

    import re
    r = re.compile(r"[-+](\d+)([^-+]*)")
    def modify(m):
    result = m.group(2)[int(m.group(1)):]
    return result
    for test, expected in (
    ("tA.-2AG.-2AG,-2ag", "tA..,"),
    (".+3ACG.+5CAACG.+3ACG.+3ACG", "...."),
    ):
    s = r.sub(modify, test)
    print "%r -> %r (%r)" % (
    test, s, expected
    )
    assert s == expected, "Nope"

    (it passes the tests as modified to "....")

    -tkc
    Tim Chase, Jan 5, 2013
    #11
  12. Sia

    Tim Chase Guest

    On 01/05/13 11:24, Tim Chase wrote:
    > I don't know how this version times out:
    >
    > import re
    > r = re.compile(r"[-+](\d+)([^-+]*)")
    > def modify(m):
    > result = m.group(2)[int(m.group(1)):]
    > return result


    Doh, I intended to change this after testing, making it just

    returm m.group(2)[int(m.group(1)):]

    rather than expending an intermediate variable I used for testing

    -tkc
    Tim Chase, Jan 5, 2013
    #12
  13. Sia

    Ian Kelly Guest

    On Sat, Jan 5, 2013 at 8:57 AM, Chris Angelico <> wrote:
    > You miss my point, though. I went for simple Pythonic code, and never
    > measured its performance, on the expectation that it's "good enough".
    > Written in C, the state machine is probably WAY faster than splitting
    > and then iterating. My C++ MUD client uses code similar to that to
    > parse TELNET and ANSI codes from a stream of bytes in a socket (and
    > one of its "states" is that there's no more data available, so wait on
    > the socket); the rewrite in a high level language divides the string
    > on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
    > then afterward splits on "\n" to divide into lines. The code's much
    > less convoluted, it's easier to test different parts (because I can
    > simply call the ANSI parser with a block of text), and on a modern
    > computer, you can't see the performance difference (since you spend
    > most of your time waiting for socket data anyway).


    Anecdotally and somewhat off-topic, when I wrote my own MUD client in
    Python, I implemented both TELNET and ANSI parsing in Python using a
    state machine processing one byte at a time (actually two state
    machines - one at the protocol layer and one at the client layer; the
    telnet module is a modified version of the twisted.conch.telnet
    module). I was worried that the processing would be too slow in
    Python. When I got it running, it turned out that there was a
    noticeable lag between input being received and displayed. But when I
    profiled the issue it actually turned out that the rich text control I
    was using, which was written in C++, wasn't able to handle a large
    buffer gracefully. The parsing code in Python did just fine and
    didn't contribute to the lag issue at all.
    Ian Kelly, Jan 5, 2013
    #13
  14. On Sun, Jan 6, 2013 at 7:04 AM, Ian Kelly <> wrote:
    > On Sat, Jan 5, 2013 at 8:57 AM, Chris Angelico <> wrote:
    >> You miss my point, though. I went for simple Pythonic code, and never
    >> measured its performance, on the expectation that it's "good enough".
    >> Written in C, the state machine is probably WAY faster than splitting
    >> and then iterating. My C++ MUD client uses code similar to that to
    >> parse TELNET and ANSI codes from a stream of bytes in a socket (and
    >> one of its "states" is that there's no more data available, so wait on
    >> the socket); the rewrite in a high level language divides the string
    >> on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
    >> then afterward splits on "\n" to divide into lines. The code's much
    >> less convoluted, it's easier to test different parts (because I can
    >> simply call the ANSI parser with a block of text), and on a modern
    >> computer, you can't see the performance difference (since you spend
    >> most of your time waiting for socket data anyway).

    >
    > Anecdotally and somewhat off-topic, when I wrote my own MUD client in
    > Python, I implemented both TELNET and ANSI parsing in Python using a
    > state machine processing one byte at a time (actually two state
    > machines - one at the protocol layer and one at the client layer; the
    > telnet module is a modified version of the twisted.conch.telnet
    > module). I was worried that the processing would be too slow in
    > Python. When I got it running, it turned out that there was a
    > noticeable lag between input being received and displayed. But when I
    > profiled the issue it actually turned out that the rich text control I
    > was using, which was written in C++, wasn't able to handle a large
    > buffer gracefully. The parsing code in Python did just fine and
    > didn't contribute to the lag issue at all.


    The opposite decision, the same result. Performance was *good enough*.
    With Gypsum, my early performance problems were almost completely
    solved by properly handling the expose-event and only painting the
    parts that changed (I don't use a rich text control, I use a
    GTK2.DrawingArea). Text processing of something that's come over a
    network is seldom a problem; if it were, we'd see noticeably slower
    downloads over HTTPS than HTTP, and that just doesn't happen. I think
    (though I can't prove) that crypto written in C is probably more
    expensive than ANSI parsing written in Python.

    ChrisA
    Chris Angelico, Jan 5, 2013
    #14
  15. Sia

    Roy Smith Guest

    In article <>,
    Chris Angelico <> wrote:

    > On Sun, Jan 6, 2013 at 7:04 AM, Ian Kelly <> wrote:
    > > On Sat, Jan 5, 2013 at 8:57 AM, Chris Angelico <> wrote:
    > >> You miss my point, though. I went for simple Pythonic code, and never
    > >> measured its performance, on the expectation that it's "good enough".
    > >> Written in C, the state machine is probably WAY faster than splitting
    > >> and then iterating. My C++ MUD client uses code similar to that to
    > >> parse TELNET and ANSI codes from a stream of bytes in a socket (and
    > >> one of its "states" is that there's no more data available, so wait on
    > >> the socket); the rewrite in a high level language divides the string
    > >> on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
    > >> then afterward splits on "\n" to divide into lines. The code's much
    > >> less convoluted, it's easier to test different parts (because I can
    > >> simply call the ANSI parser with a block of text), and on a modern
    > >> computer, you can't see the performance difference (since you spend
    > >> most of your time waiting for socket data anyway).

    > >
    > > Anecdotally and somewhat off-topic, when I wrote my own MUD client in
    > > Python, I implemented both TELNET and ANSI parsing in Python using a
    > > state machine processing one byte at a time (actually two state
    > > machines - one at the protocol layer and one at the client layer; the
    > > telnet module is a modified version of the twisted.conch.telnet
    > > module). I was worried that the processing would be too slow in
    > > Python. When I got it running, it turned out that there was a
    > > noticeable lag between input being received and displayed. But when I
    > > profiled the issue it actually turned out that the rich text control I
    > > was using, which was written in C++, wasn't able to handle a large
    > > buffer gracefully. The parsing code in Python did just fine and
    > > didn't contribute to the lag issue at all.

    >
    > The opposite decision, the same result. Performance was *good enough*.
    > With Gypsum, my early performance problems were almost completely
    > solved by properly handling the expose-event and only painting the
    > parts that changed (I don't use a rich text control, I use a
    > GTK2.DrawingArea). Text processing of something that's come over a
    > network is seldom a problem; if it were, we'd see noticeably slower
    > downloads over HTTPS than HTTP, and that just doesn't happen. I think
    > (though I can't prove) that crypto written in C is probably more
    > expensive than ANSI parsing written in Python.
    >
    > ChrisA


    It's rare to find applications these days that are truly CPU bound.
    Once you've used some reasonable algorithm, i.e. not done anything in
    O(n^2) that could have been done in O(n) or O(n log n), you will more
    often run up against I/O speed, database speed, network latency, memory
    exhaustion, or some such as the reason your code is too slow.

    The upshot of this is that for most things, Python (even though it runs
    an order of magnitude slower than C), will always be *good enough*.

    I'm sure I've mentioned this before, but the application code for
    Songza.com is 100% Python. We pretty much can't even measure how much
    CPU time is spent running Python code. Everything is database, network,
    and (occasionally) disk throughput.
    Roy Smith, Jan 5, 2013
    #15
  16. On 01/05/2013 03:35 AM, Sia wrote:
    > I have strings such as:
    >
    > tA.-2AG.-2AG,-2ag
    > or
    > .+3ACG.+5CAACG.+3ACG.+3ACG
    >
    > The plus and minus signs are always followed by a number (say, i). I

    want python to find each single plus or minus, remove the sign, the
    number after it and remove i characters after that. So the two strings
    above become:
    >
    > tA..,
    > and
    > ...
    >
    > How can I do that?
    > Thanks.



    I think it's a bit cleaner and nicer to do something similar to
    itertools.takewhile but takewhile 'eats' a single next value.
    I was actually doing some stuff that also needed this. I wonder if
    there's a more elegant, robust way to do this?

    Here's what I got for now:


    class BIterator(object):
    """Iterator with 'buffered' takewhile."""

    def __init__(self, seq):
    self.seq = iter(seq)
    self.buffer = []
    self.end_marker = object()
    self.last = None

    def consume(self, n):
    for _ in range(n): self.next()

    def next(self):
    val = self.buffer.pop() if self.buffer else next(self.seq,
    self.end_marker)
    self.last = val
    return val

    def takewhile(self, test):
    lst = []
    while True:
    val = self.next()
    if val is self.end_marker:
    return lst
    elif test(val):
    lst.append(val)
    else:
    self.buffer.append(val)
    return lst

    def joined_takewhile(self, test):
    return ''.join(self.takewhile(test))

    def done(self):
    return bool(self.last is self.end_marker)


    s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
    not_plusminus = lambda x: x not in "+-"
    isdigit = lambda x: x.isdigit()

    def process(s):
    lst = []
    s = BIterator(s)

    while True:
    lst.extend(s.takewhile(not_plusminus))
    if s.done(): break
    s.next()
    n = int(s.joined_takewhile(isdigit))
    s.consume(n)

    return ''.join(lst)


    print(process(s))


    Obviously it assumes the input is well-formed, but the logic would be
    very easy to change to, for example, check for s.done() after each step.

    - mitya



    --
    Lark's Tongue Guide to Python: http://lightbird.net/larks/
    Mitya Sirenef, Jan 6, 2013
    #16
  17. Sia

    Roy Smith Guest

    In article <>,
    Roy Smith <> wrote:

    > It's rare to find applications these days that are truly CPU bound.
    > Once you've used some reasonable algorithm, i.e. not done anything in
    > O(n^2) that could have been done in O(n) or O(n log n), you will more
    > often run up against I/O speed, database speed, network latency, memory
    > exhaustion, or some such as the reason your code is too slow.


    Well, I just found a counter-example :)

    I've been doing some log analysis. It's been taking a grovelingly long
    time, so I decided to fire up the profiler and see what's taking so
    long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots
    might be (looking up IP addresses in the geolocation database, or
    producing some pretty pictures using matplotlib). It was just a matter
    of figuring out which it was.

    As with most attempts to out-guess the profiler, I was totally,
    absolutely, and embarrassingly wrong.

    It turns out we were spending most of the time parsing timestamps!
    Since there's no convenient way (I don't consider strptime() to be
    convenient) to parse isoformat strings in the standard library, our
    habit has been to use the oh-so-simple parser from the third-party
    dateutil package. Well, it turns out that's slow as all get-out
    (probably because it's trying to be smart about auto-recognizing
    formats). For the test I ran (on a few percent of the real data), we
    spent 90 seconds in parse().

    OK, so I dragged out the strptime() docs and built the stupid format
    string (%Y-%m-%dT%H:%M:%S+00:00). That got us down to 25 seconds in
    strptime().

    But, I could also see it was spending a significant amount in routines
    that looked like they were computing things like day of the week that we
    didn't need. For what I was doing, we only really needed the hour and
    minute. So I tried:

    t_hour = int(date[11:13])
    t_minute = int(date[14:16])

    that got us down to 12 seconds overall (including the geolocation and
    pretty pictures).

    I think it turns out we never do anything with the hour and minute other
    than print them back out, so just

    t_hour_minute = date[11:16]

    would probably be good enough, but I think I'm going to stop where I am
    and declare victory :)
    Roy Smith, Jan 6, 2013
    #17
  18. On 01/06/2013 01:32 AM, Mitya Sirenef wrote:
    > On 01/05/2013 03:35 AM, Sia wrote:
    >> I have strings such as:
    > >
    > > tA.-2AG.-2AG,-2ag
    > > or
    > > .+3ACG.+5CAACG.+3ACG.+3ACG
    > >
    > > The plus and minus signs are always followed by a number (say, i). I

    > want python to find each single plus or minus, remove the sign, the
    > number after it and remove i characters after that. So the two strings
    > above become:
    > >
    > > tA..,
    > > and
    > > ...
    > >
    > > How can I do that?
    > > Thanks.

    >
    >
    > I think it's a bit cleaner and nicer to do something similar to
    > itertools.takewhile but takewhile 'eats' a single next value.
    > I was actually doing some stuff that also needed this. I wonder if
    > there's a more elegant, robust way to do this?
    >
    > Here's what I got for now:
    >
    >
    > class BIterator(object):
    > """Iterator with 'buffered' takewhile."""
    >
    > def __init__(self, seq):
    > self.seq = iter(seq)
    > self.buffer = []
    > self.end_marker = object()
    > self.last = None
    >
    > def consume(self, n):
    > for _ in range(n): self.next()
    >
    > def next(self):
    > val = self.buffer.pop() if self.buffer else next(self.seq,
    > self.end_marker)
    > self.last = val
    > return val
    >
    > def takewhile(self, test):
    > lst = []
    > while True:
    > val = self.next()
    > if val is self.end_marker:
    > return lst
    > elif test(val):
    > lst.append(val)
    > else:
    > self.buffer.append(val)
    > return lst
    >
    > def joined_takewhile(self, test):
    > return ''.join(self.takewhile(test))
    >
    > def done(self):
    > return bool(self.last is self.end_marker)
    >
    >
    > s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
    > not_plusminus = lambda x: x not in "+-"
    > isdigit = lambda x: x.isdigit()
    >
    > def process(s):
    > lst = []
    > s = BIterator(s)
    >
    > while True:
    > lst.extend(s.takewhile(not_plusminus))
    > if s.done(): break
    > s.next()
    > n = int(s.joined_takewhile(isdigit))
    > s.consume(n)
    >
    > return ''.join(lst)
    >
    >
    > print(process(s))
    >
    >
    > Obviously it assumes the input is well-formed, but the logic would be
    > very easy to change to, for example, check for s.done() after each step.
    >
    > - mitya
    >
    >
    >


    I've added some refinements:



    class BIterator(object):
    """Iterator with 'buffered' takewhile and takeuntil."""

    def __init__(self, seq):
    self.seq = iter(seq)
    self.buffer = []
    self.end_marker = object()
    self.last = None

    def __bool__(self):
    return self.last is not self.end_marker

    def __next__(self):
    val = self.buffer.pop() if self.buffer else next(self.seq,
    self.end_marker)
    self.last = val
    return val

    def consume(self, n):
    for _ in range(n): next(self)

    def takewhile(self, test):
    lst = []
    while True:
    val = next(self)
    if val is self.end_marker:
    return lst
    elif test(val):
    lst.append(val)
    else:
    self.buffer.append(val)
    return lst

    def takeuntil(self, test):
    negtest = lambda x: not test(x)
    return self.takewhile(negtest)

    def joined_takewhile(self, test):
    return ''.join(self.takewhile(test))

    def joined_takeuntil(self, test):
    return ''.join(self.takeuntil(test))


    def process(s):
    s = BIterator(s)
    lst = []
    plusminus = lambda x: x in "+-"
    isdigit = lambda x: x.isdigit()

    while s:
    lst.extend(s.takeuntil(plusminus))
    next(s)
    n = s.joined_takewhile(isdigit) or 0
    s.consume(int(n))

    return ''.join(lst)


    s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
    print(process(s))




    --
    Lark's Tongue Guide to Python: http://lightbird.net/larks/
    Mitya Sirenef, Jan 6, 2013
    #18
  19. On Sun, 06 Jan 2013 12:28:55 -0500, Roy Smith wrote:

    > I've been doing some log analysis. It's been taking a grovelingly long
    > time, so I decided to fire up the profiler and see what's taking so
    > long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots
    > might be (looking up IP addresses in the geolocation database, or
    > producing some pretty pictures using matplotlib). It was just a matter
    > of figuring out which it was.
    >
    > As with most attempts to out-guess the profiler, I was totally,
    > absolutely, and embarrassingly wrong.


    +1 QOTW

    Thank you for sharing this with us.


    --
    Steven
    Steven D'Aprano, Jan 6, 2013
    #19
  20. Sia

    Nick Mellor Guest

    Hi Sia,

    Here's another variation. It's within my tolerance for readability :) and also quick, if that's an issue. It does 100,000 of your longer string in a couple of seconds on my venerable laptop.

    It handles only single-digit numbers. For multi-digit, I'd be inclined to have a look at takewhile and/or dropwhile, both in itertools (Python 2.7 and 3.x only.)


    from string import maketrans

    class redux:

    def __init__(self):
    intab = '+-'
    outtab = ' ' # two spaces
    self.trantab = maketrans(intab, outtab)


    def reduce_plusminus(self, s):
    list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
    for r
    in s.translate(self.trantab).split()]
    return ''.join(list_form)

    if __name__ == "__main__":
    p = redux()
    print p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")
    print p.reduce_plusminus("tA.-2AG.-2AG,-2ag")

    for n in range(100000): p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")

    On Saturday, 5 January 2013 19:35:26 UTC+11, Sia wrote:
    > I have strings such as:
    >
    >
    >
    > tA.-2AG.-2AG,-2ag
    >
    > or
    >
    > .+3ACG.+5CAACG.+3ACG.+3ACG
    >
    >
    >
    > The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:
    >
    >
    >
    > tA..,
    >
    > and
    >
    > ...
    >
    >
    >
    > How can I do that?
    >
    > Thanks.
    Nick Mellor, Jan 7, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Brent White
    Replies:
    2
    Views:
    527
    Peter Bromberg [C# MVP]
    Jan 30, 2008
  2. Troy
    Replies:
    0
    Views:
    378
  3. Navin
    Replies:
    1
    Views:
    690
    Ken Schaefer
    Sep 9, 2003
  4. Dot Baiki

    Need someone who can help

    Dot Baiki, Sep 15, 2008, in forum: Ruby
    Replies:
    10
    Views:
    227
    Dot Baiki
    Sep 15, 2008
  5. Tyler
    Replies:
    7
    Views:
    550
    James Kuyper
    Apr 9, 2012
Loading...

Share This Page