find sublist inside list

  • Thread starter Matthias Gallé
  • Start date
M

Matthias Gallé

Hi.

My problem is to replace all occurrences of a sublist with a new element.

Example:
Given ['a','c','a','c','c','g','a','c'] I want to replace all
occurrences of ['a','c'] by 6 (result [6,6,'c','g',6]).

If I do this with string ('acaccgac') I have the advantage of all the
'find' functions, but perfomance is bad and some extra care must be
taken if one element consist of more then one character (case of 11 for
example)

So I really would like to work with lists straightforward, but I could
not found anything to search a sublist inside a list.
Any propositions for a simple solution?

Thanks in advance,


--
Matthias Gallé
Project Symbiose
Centre de Recherche INRIA Rennes - Bretagne Atlantique,
Campus de Beaulieu, 35042 Rennes cedex, France
tel: (33|0) 2 9984 7523
http://www.irisa.fr/symbiose/matthias_galle
 
B

bearophileHUGS

Matthias Gallé:
My problem is to replace all occurrences of a sublist with a new element.
Example:
Given ['a','c','a','c','c','g','a','c'] I want to replace all
occurrences of ['a','c'] by 6 (result [6,6,'c','g',6]).

There are several ways to solve this problem. Representing a string as
a list of "chars" (1-strings) is fine if you work with small strings,
but once they get longer you need too much memory and time.

This is a starting point:
mapper = {"a":6, "c":6}
data = 'acaccgac'
[mapper.get(c, c) for c in data]
[6, 6, 6, 6, 6, 'g', 6, 6]

If you need higher performance, to represent few small numbers into a
well defined string like genomic data, you can use ASCII values of 6
and 11:
from string import maketrans
tab = maketrans("acgt", "".join([chr(6), chr(6), "gt"]))
s.translate(tab)
'\x06\x06\x06\x06\x06g\x06\x06'

Later in processing it's easy to tell apart normal genome bases from
those small numbers.

Note that there is the array.array function too in the standard lib,
and elsewhere there is Numpy too.

There are several other possible solutions, but I stop here so you can
explain your purposes and constraints better.

Bye,
bearophile
 
J

John O'Hagan

Hi.

My problem is to replace all occurrences of a sublist with a new element.

Example:
Given ['a','c','a','c','c','g','a','c'] I want to replace all
occurrences of ['a','c'] by 6 (result [6,6,'c','g',6]).

li=['a', 'c', 'a', 'c', 'c', 'g', 'a', 'c']
for i in range(len(li)):
if li[i:i + 2] == ['a', 'c']:
li[i:i + 2] = ['6']

HTH,

John
 
B

bearophileHUGS

John O'Hagan:
li=['a', 'c', 'a', 'c', 'c', 'g', 'a', 'c']
for i  in range(len(li)):
    if li[i:i + 2] == ['a', 'c']:
        li[i:i + 2] = ['6']

Oh well, I have done a mistake, it seems.
Another solution then:
'\x06\x06cg\x06'

Bye,
bearophile
 
M

Matthias Gallé

John O'Hagan:
li=['a', 'c', 'a', 'c', 'c', 'g', 'a', 'c']
for i in range(len(li)):
if li[i:i + 2] == ['a', 'c']:
li[i:i + 2] = ['6']

Oh well, I have done a mistake, it seems.
Another solution then:
'\x06\x06cg\x06'

Bye,
bearophile

Thanks bearophile and John for your quick answers.
Unfortunately, the int that can replace a sublist can be > 255, but
John's answer looks simple and good enough for me. I will use it as a
starting point.

Thank's again.

--
Matthias Gallé
Project Symbiose
Centre de Recherche INRIA Rennes - Bretagne Atlantique,
Campus de Beaulieu, 35042 Rennes cedex, France
tel: (33|0) 2 9984 7523
http://www.irisa.fr/symbiose/matthias_galle
 
B

bearophileHUGS

Matthias Gallé:
the int that can replace a sublist can be > 255,<

You didn't specify your integer ranges.
Probably there are many other solutions for your problem, but you have
to give more information. Like the typical array size, typical range
of the numbers, how much important is total memory used, how much
important is running speed, what kind of processing (or serialization/
output) you later have to do with such arrays, and so on.
Other solutions include using an array('H', []), and using 0-255 to
represent ASCII and numbers >255 <2^16 to represent the other numbers,
etc.
If speed is critical you can even think about creating a little
function with PyInline or D+Pyd, etc.

Bye,
bearophile
 
M

MRAB

Matthias said:
John O'Hagan:
li=['a', 'c', 'a', 'c', 'c', 'g', 'a', 'c']
for i in range(len(li)):
if li[i:i + 2] == ['a', 'c']:
li[i:i + 2] = ['6']

Oh well, I have done a mistake, it seems.
Another solution then:
'acaccgac'.replace("ac", chr(6))
'\x06\x06cg\x06'

Bye,
bearophile

Thanks bearophile and John for your quick answers.
Unfortunately, the int that can replace a sublist can be > 255, but
John's answer looks simple and good enough for me. I will use it as a
starting point.
John's solution changes the length of the list over which it's
iterating.

I'd suggest something more like:

li = ['a', 'c', 'a', 'c', 'c', 'g', 'a', 'c']
pos = 0
try:
while True:
pos = li.index('a', pos)
if li[pos : pos + 2] == ['a', 'c']:
li[pos : pos + 2] = [6]
pos += 1
except ValueError:
pass
 
A

Aahz

My problem is to replace all occurrences of a sublist with a new element.

Example:
Given ['a','c','a','c','c','g','a','c'] I want to replace all
occurrences of ['a','c'] by 6 (result [6,6,'c','g',6]).

What's your goal? After you do this once, you cannot repeat the
operation with a different sublist because you are not tracking the
source of the numbers. You might look into standard compression
algorithms for information about how to accomplish this.
 
J

John O'Hagan

Hi.

My problem is to replace all occurrences of a sublist with a new
element.

Example:
Given ['a','c','a','c','c','g','a','c'] I want to replace all
occurrences of ['a','c'] by 6 (result [6,6,'c','g',6]).

li=['a', 'c', 'a', 'c', 'c', 'g', 'a', 'c']
for i in range(len(li)):
if li[i:i + 2] == ['a', 'c']:
li[i:i + 2] = ['6']

HTH,

John

Beware that you are mutating the list you are iterating over. That could
lead to some strange bugs (for instance if you replaced the deleted items
with a longer sequence, the range(len(li)) would still go up to the
original lenght).
It is better to modify a new list instead. Eg you could append to a new
list.
[...]

Quite right, while it happens to work in this particular example, as you and
MRAB point out, it's generally dangerous (and in fact this one silently and
uselessly iterates over the last couple of indexes which no longer exist); a
new list could be created like this:

index=0
newli=[]
while index<len(li):
if li[index:index+2]==['a', 'c']:
newli.append(6)
index += 2
else:
newli.append(li[index])
index += 1

Regards,

John
 
M

mzdude

John O'Hagan:
li=['a', 'c', 'a', 'c', 'c', 'g', 'a', 'c']
for i  in range(len(li)):
    if li[i:i + 2] == ['a', 'c']:
        li[i:i + 2] = ['6']
Oh well, I have done a mistake, it seems.
Another solution then:

Bye,
bearophile

Thanks bearophile and John for your quick answers.
Unfortunately, the int that can replace a sublist can be > 255, but
John's answer looks simple and good enough for me. I will use it as a
starting point.

substring isn't limited to 0..255'\x00x257\x00x257\x00x257\x00x257cg\x00x257\x00x257'
 
G

Gabriel Genellina

substring isn't limited to 0..255
'\x00x257\x00x257\x00x257\x00x257cg\x00x257\x00x257'

This isn't what you think it is. Look carefully:

py> substring = "\0x%d\0x%d" % (257,257)
py> len(substring)
10
py> list(substring)
['\x00', 'x', '2', '5', '7', '\x00', 'x', '2', '5', '7']
 
T

Terry Reedy

Matthias said:
Hi.

My problem is to replace all occurrences of a sublist with a new element.

Example:
Given ['a','c','a','c','c','g','a','c'] I want to replace all
occurrences of ['a','c'] by 6 (result [6,6,'c','g',6]).

If I do this with string ('acaccgac') I have the advantage of all the
'find' functions, but perfomance is bad and some extra care must be
taken if one element consist of more then one character (case of 11 for
example)

So I really would like to work with lists straightforward, but I could
not found anything to search a sublist inside a list.
Any propositions for a simple solution?

For a mutable homogenous array, consider the array module.
Any algorithm that applies to a sequence of chars can be adjusted to
other sequences. For the above case, remember than you can easily
filter None out of a sequence. IE, replace 'a','c' with 6, None and
then filter when done.
 
G

Gerard Flanagan

Matthias said:
Hi.

My problem is to replace all occurrences of a sublist with a new element.

Example:
Given ['a','c','a','c','c','g','a','c'] I want to replace all
occurrences of ['a','c'] by 6 (result [6,6,'c','g',6]).

For novelty value:

from itertools import izip

def replace2(data, pattern):
assert len(pattern) == 2
pattern = tuple(pattern)
icopy = iter(data)
icopy.next()
gen = izip(data, icopy)
while True:
item = gen.next()
if item == pattern:
yield '6'
gen.next()
else:
yield item[0]

# works if list ends with ['a', 'c']
data = ['g', 'a', 'c', 'a', 'c', 'a', 'a', 'a', 'g', 'a', 'c']
want = 'g66aaag6'
assert ''.join(replace2(data, ['a', 'c'])) == want

# otherwise you lose the last element of the tail
data = ['g', 'a', 'c', 'a', 'c', 'a', 'a', 'a', 'g', 'a', 'c', 'c', 'g']
want = 'g66aaag6cg'
get = 'g66aaag6c'
assert not ''.join(replace2(data, ['a', 'c'])) == want
assert ''.join(replace2(data, ['a', 'c'])) == get

# fix by adding the pattern to the end of the data as a sentinel

def replace2(data, pattern):
assert len(pattern) == 2
def _replace2(data, pattern):
pattern = tuple(pattern)
icopy = iter(data)
icopy.next()
gen = izip(data, icopy)
while True:
item = gen.next()
if item == pattern:
yield '6'
gen.next()
else:
yield item[0]
data = data + pattern
return list(_replace2(data, pattern))[:-1]

data = ['g', 'a', 'c', 'a', 'c', 'a', 'a', 'a', 'g', 'a', 'c']
want = 'g66aaag6'
assert ''.join(replace2(data, ['a', 'c'])) == want

data = ['g', 'a', 'c', 'a', 'c', 'a', 'a', 'a', 'g', 'a', 'c', 'c', 'g']
want = 'g66aaag6cg'
assert ''.join(replace2(data, ['a', 'c'])) == want

print 'done'
 
M

mzdude

substring isn't limited to 0..255
'\x00x257\x00x257\x00x257\x00x257cg\x00x257\x00x257'

This isn't what you think it is. Look carefully:

py> substring = "\0x%d\0x%d" % (257,257)
py> len(substring)
10
py> list(substring)
['\x00', 'x', '2', '5', '7', '\x00', 'x', '2', '5', '7']

OOPS. My bad. But I'm not going to give up.

l = ['a','b','c','a','c']
us = unicode("".join(l))
substr = unichr(257) + unichr(257)
us = us.replace(u'ac',substr)
print len(us)
print list(us)


output is5
[u'a', u'b', u'c', u'\u0101', u'\u0101']
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top