Strange behavior

L

light1quark

Hi, I am migrating from PHP to Python and I am slightly confused.

I am making a function that takes a startingList, finds all the strings in the list that begin with 'x', removes those strings and puts them into a xOnlyList.

However if you run the code you will notice only one of the strings beginning with 'x' is removed from the startingList.
If I comment out 'startingList.remove(str);' the code runs with both strings beginning with 'x' being put in the xOnlyList.
Using the print statement I noticed that the second string that begins with 'x' isn't even identified by the function. Why does this happen?

def testFunc(startingList):
xOnlyList = [];
for str in startingList:
if (str[0] == 'x'):
print str;
xOnlyList.append(str)
startingList.remove(str) #this seems to be the problem
print xOnlyList;
print startingList
testFunc(['xasd', 'xjkl', 'sefwr', 'dfsews'])

#Thanks for your help!
 
A

Alain Ketterlin

However if you run the code you will notice only one of the strings
beginning with 'x' is removed from the startingList.
def testFunc(startingList):
xOnlyList = [];
for str in startingList:
if (str[0] == 'x'):
print str;
xOnlyList.append(str)
startingList.remove(str) #this seems to be the problem
print xOnlyList;
print startingList
testFunc(['xasd', 'xjkl', 'sefwr', 'dfsews'])

#Thanks for your help!

Try with ['xasd', 'sefwr', 'xjkl', 'dfsews'] and you'll understand what
happens. Also, have a look at:

http://docs.python.org/reference/compound_stmts.html#the-for-statement

You can't modify the list you're iterating on, better use another list
to collect the result.

-- Alain.

P/S: str is a builtin, you'd better avoid assigning to it.
 
T

Terry Reedy

However if you run the code you will notice only one of the strings
beginning with 'x' is removed from the startingList.
def testFunc(startingList):
xOnlyList = [];
for str in startingList:
if (str[0] == 'x'):
print str;
xOnlyList.append(str)
startingList.remove(str) #this seems to be the problem
print xOnlyList;
print startingList
testFunc(['xasd', 'xjkl', 'sefwr', 'dfsews'])

#Thanks for your help!

Try with ['xasd', 'sefwr', 'xjkl', 'dfsews'] and you'll understand what
happens. Also, have a look at:

http://docs.python.org/reference/compound_stmts.html#the-for-statement

You can't modify the list you're iterating on,

Except he obviously did ;-).
(Modifying set or dict raises SomeError.)

Indeed, people routine *replace* items while iterating.

def squarelist(lis):
for i, n in enumerate(lis):
lis = n*n
return lis

print(squarelist([0,1,2,3,4,5]))
# [0, 1, 4, 9, 16, 25]

Removals can be handled by iterating in reverse. This works even with
duplicates because if the item removed is not the one tested, the one
tested gets retested.

def removeodd(lis):
for n in reversed(lis):
if n % 2:
lis.remove(n)
print(n, lis)

ll = [0,1, 5, 5, 4, 5]
removeodd(ll)5 [0, 1, 5, 4, 5]
5 [0, 1, 4, 5]
5 [0, 1, 4]
4 [0, 1, 4]
1 [0, 4]
0 [0, 4]
better use another list to collect the result.

If there are very many removals, a new list will be faster, even if one
needs to copy the new list back into the original, as k removals from
len n list is O(k*n) versus O(n) for new list and copy.
P/S: str is a builtin, you'd better avoid assigning to it.

Agreed. People have actually posted code doing something like

....
list = [1,2,3]
....
z = list(x)
....
and wondered and asked why it does not work.
 
V

Virgil Stokes

Hi, I am migrating from PHP to Python and I am slightly confused.

I am making a function that takes a startingList, finds all the strings in the list that begin with 'x', removes those strings and puts them into a xOnlyList.

However if you run the code you will notice only one of the strings beginning with 'x' is removed from the startingList.
If I comment out 'startingList.remove(str);' the code runs with both strings beginning with 'x' being put in the xOnlyList.
Using the print statement I noticed that the second string that begins with 'x' isn't even identified by the function. Why does this happen?

def testFunc(startingList):
xOnlyList = [];
for str in startingList:
if (str[0] == 'x'):
print str;
xOnlyList.append(str)
startingList.remove(str) #this seems to be the problem
print xOnlyList;
print startingList
testFunc(['xasd', 'xjkl', 'sefwr', 'dfsews'])

#Thanks for your help!

You might find the following useful:

def testFunc(startingList):
xOnlyList = []; j = -1
for xl in startingList:
if (xl[0] == 'x'):
xOnlyList.append(xl)
else:
j += 1
startingList[j] = xl
if j == -1:
startingList = []
else:
del startingList[j:-1]

return(xOnlyList)


testList1 = ['xasd', 'xjkl', 'sefwr', 'dfsews']
testList2 = ['xasd', 'xjkl', 'xsefwr', 'xdfsews']
testList3 = ['xasd', 'jkl', 'sefwr', 'dfsews']
testList4 = ['asd', 'jkl', 'sefwr', 'dfsews']

xOnlyList = testFunc(testList1)
print 'xOnlyList = ',xOnlyList
print 'testList = ',testList1
xOnlyList = testFunc(testList2)
print 'xOnlyList = ',xOnlyList
print 'testList = ',testList2
xOnlyList = testFunc(testList3)
print 'xOnlyList = ',xOnlyList
print 'testList = ',testList3
xOnlyList = testFunc(testList4)
print 'xOnlyList = ',xOnlyList
print 'testList = ',testList4

And here is another version using list comprehension that I prefer

testList1 = ['xasd', 'xjkl', 'sefwr', 'dfsews']
testList2 = ['xasd', 'xjkl', 'xsefwr', 'xdfsews']
testList3 = ['xasd', 'jkl', 'sefwr', 'dfsews']
testList4 = ['asd', 'jkl', 'sefwr', 'dfsews']

def testFunc2(startingList):
return([x for x in startingList if x[0] == 'x'], [x for x in
startingList if x[0] != 'x'])

xOnlyList,testList = testFunc2(testList1)
print xOnlyList
print testList
xOnlyList,testList = testFunc2(testList2)
print xOnlyList
print testList
xOnlyList,testList = testFunc2(testList3)
print xOnlyList
print testList
xOnlyList,testList = testFunc2(testList4)
print xOnlyList
print testList
 
C

Chris Angelico

def testFunc(startingList):
xOnlyList = [];
for str in startingList:
if (str[0] == 'x'):
print str;
xOnlyList.append(str)
startingList.remove(str) #this seems to be the problem
print xOnlyList;
print startingList
testFunc(['xasd', 'xjkl', 'sefwr', 'dfsews'])

Other people have explained the problem with your code. I'll take this
example as a way of introducing you to one of Python's handy features
- it's an idea borrowed from functional languages, and is extremely
handy. It's called the "list comprehension", and can be looked up in
the docs under that name,

def testFunc(startingList):
xOnlyList = [strng for strng in startingList if strng[0] == 'x']
startingList = [strng for strng in startingList if strng[0] != 'x']
print(xOnlyList)
print(startingList)

It's a compact notation for building a list from another list. (Note
that I changed "str" to "strng" to avoid shadowing the built-in name
"str", as others suggested.)

(Unrelated side point: Putting parentheses around the print statements
makes them compatible with Python 3, in which 'print' is a function.
Unless something's binding you to Python 2, consider working with the
current version - Python 2 won't get any more features added to it any
more.)

Python's an awesome language. You may have to get your head around a
few new concepts as you shift thinking from PHP's, but it's well worth
while.

Chris Angelico
 
S

Steven D'Aprano

You might find the following useful:

def testFunc(startingList):
xOnlyList = []; j = -1
for xl in startingList:
if (xl[0] == 'x'):

That's going to fail in the starting list contains an empty string. Use
xl.startswith('x') instead.

xOnlyList.append(xl)
else:
j += 1
startingList[j] = xl

Very cunning, but I have to say that your algorithm fails the "is this
obviously correct without needing to study it?" test. Sometimes that is
unavoidable, but for something like this, there are simpler ways to solve
the same problem.

if j == -1:
startingList = []
else:
del startingList[j:-1]
return(xOnlyList)

And here is another version using list comprehension that I prefer
def testFunc2(startingList):
return([x for x in startingList if x[0] == 'x'], [x for x in
startingList if x[0] != 'x'])

This walks over the starting list twice, doing essentially the same thing
both times. It also fails to meet the stated requirement that
startingList is modified in place, by returning a new list instead.
Here's an example of what I mean:

py> mylist = mylist2 = ['a', 'x', 'b', 'xx', 'cx'] # two names for one
list
py> result, mylist = testFunc2(mylist)
py> mylist
['a', 'b', 'cx']
py> mylist2 # should be same as mylist
['a', 'x', 'b', 'xx', 'cx']

Here is the obvious algorithm for extracting and removing words starting
with 'x'. It walks the starting list only once, and modifies it in place.
The only trick needed is list slice assignment at the end.

def extract_x_words(words):
words_with_x = []
words_without_x = []
for word in words:
if word.startswith('x'):
words_with_x.append(word)
else:
words_without_x.append(word)
words[:] = words_without_x # slice assignment
return words_with_x


The only downside of this is that if the list of words is so enormous
that you can fit it in memory *once* but not *twice*, this may fail. But
the same applies to the list comprehension solution.
 
A

Alain Ketterlin

Chris Angelico said:
Other people have explained the problem with your code. I'll take this
example as a way of introducing you to one of Python's handy features
- it's an idea borrowed from functional languages, and is extremely
handy. It's called the "list comprehension", and can be looked up in
the docs under that name,

def testFunc(startingList):
xOnlyList = [strng for strng in startingList if strng[0] == 'x']
startingList = [strng for strng in startingList if strng[0] != 'x']
print(xOnlyList)
print(startingList)

It's a compact notation for building a list from another list. (Note
that I changed "str" to "strng" to avoid shadowing the built-in name
"str", as others suggested.)

Fully agree with you: list comprehension is, imo, the most useful
program construct ever. Extremely useful.

But not when it makes the program traverse twice the same list, where
one traversal is enough.

-- Alain.
 
A

Alain Ketterlin

I got my answer by reading your posts and referring to:
http://docs.python.org/reference/compound_stmts.html#the-for-statement
(particularly the shaded grey box)

Not that the problem is not specific to python (if you erase the current
element when traversing a STL list in C++ you'll get a crash as well).
I guess I should have (obviously) looked at the doc's before posting
here; but im a noob.

Python has several surprising features. I think it is a good idea to
take some time to read the language reference, from cover to cover
(before or after the various tutorials, depending on your background).

-- Alain.
 
P

Peter Otten

Virgil said:
def testFunc(startingList):
xOnlyList = []; j = -1
for xl in startingList:
if (xl[0] == 'x'):
That's going to fail in the starting list contains an empty string. Use
xl.startswith('x') instead.
Yes, but this was by design (tacitly assumed that startingList was both a
list and non-empty).

You missunderstood it will fail if the list contains an empty string, not if
the list itself is empty:
words = ["alpha", "", "xgamma"]
[word for word in words if word[0] == "x"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range

The startswith() version:
[word for word in words if word.startswith("x")]
['xgamma']

Also possible:
[word for word in words if word[:1] == "x"]
['xgamma']

def testFunc1(startingList):
'''
Algorithm-1
Note:
One should check for an empty startingList before
calling testFunc1 -- If this possibility exists!
'''
return([x for x in startingList if x[0] == 'x'],
[x for x in startingList if x[0] != 'x'])


I would be interested in seeing code that is faster than algorithm-1

In pure Python? Perhaps the messy variant:

def test_func(words):
nox = []
append = nox.append
withx = [x for x in words if x[0] == 'x' or append(x)]
return withx, nox
 
S

Steven D'Aprano

You might find the following useful:

def testFunc(startingList):
xOnlyList = []; j = -1
for xl in startingList:
if (xl[0] == 'x'):
That's going to fail in the starting list contains an empty string. Use
xl.startswith('x') instead.

Yes, but this was by design (tacitly assumed that startingList was both
a list and non-empty).

As Peter already pointed out, I said it would fail if the list contains
an empty string, not if the list was empty.

xOnlyList.append(xl)
else:
j += 1
startingList[j] = xl

Very cunning, but I have to say that your algorithm fails the "is this
obviously correct without needing to study it?" test. Sometimes that is
unavoidable, but for something like this, there are simpler ways to
solve the same problem.

Sorry, but I do not sure what you mean here.

In a perfect world, you should be able to look at a piece of code, read
it once, and see whether or not it is correct. That is what I mean by
"obviously correct". For example, if I have a function that takes an
argument, doubles it, and prints the result:

def f1(x):
print(2*x)


that is obviously correct. Whereas this is not:

def f2(x):
y = (x + 5)**2 - (x + 4)**2
sys.stdout.write(str(y - 9) + '\n')


because you have to study it to see whether or not it works correctly.

Not all programs are simple enough to be obviously correct. Sometimes you
have no choice but to write something which requires cleverness to get
the right result. But this is not one of those cases. You should almost
always prefer simple code over clever code, because the greatest expense
in programming (time, effort and money) is to make code correct.

Most code does not need to be fast. But all code needs to be correct.


[...]
This can meet the requirement that startingList is modified in place via
the call to this function (see the attached code).

Good grief! See, that's exactly the sort of thing I'm talking about.
Without *detailed* study of your attached code, how can I possibly know
what it does or whether it does it correctly?

Your timing code calculates the mean using a recursive algorithm. Why
don't you calculate the mean the standard way: add the numbers and divide
by the total? What benefit do you gain from a more complicated algorithm
when a simple one will do the job just as well?

You have spent a lot of effort creating a complicated, non-obvious piece
of timing code, with different random seeds for each run, and complicated
ways of calculating timing statistics... but unfortunately the most
important part of any timing test, the actually *timing*, is not done
correctly. Consequently, your code is not correct.

With an average time of a fraction of a second, none of those timing
results are trustworthy, because they are vulnerable to interference from
other processes, the operating system, and other random noise. You spend
a lot of time processing the timing results, but it is Garbage In,
Garbage Out -- the results are not trustworthy, and if they are correct,
it is only by accident.

Later in your post, you run some tests, and are surprised by the result:
Why is algorithm-2A slower than algorithm-2?

It isn't slower. It is physically impossible, since 2A does *less* work
than 2. This demonstrates that you are actually taking a noisy
measurement: the values you get have random noise, and you don't make any
effort to minimise that noise. Hence GIGO.

The right way to test small code snippets is with the timeit module. It
is carefully written to overcome as much random noise as possible. But
even there, the authors of the timeit module are very clear that you
should not try to calculate means, let alone higher order statistics like
standard deviation. The only statistic which is trustworthy is to run as
many trials as you can afford, and select the minimum value.

So here is my timing code, which is much shorter and simpler and doesn't
try to do too much. You do need to understand the timeit.Timer class:

timeit.Timer creates a timer object; timer.repeat does the actual timing.
The specific arguments to them are not vital to understand, but you can
read the documentation if you wish to find out what they mean.

First, I define the two functions. I compare similar functions that have
the same effect. Neither modifies the input argument in place. Copy and
paste the following block into an interactive interpreter:

# Start block

def f1(startingList):
return ([x for x in startingList if x[0] == 'x'],
[x for x in startingList if x[0] != 'x'])

# Note that the above function is INCORRECT, it will fail if a string is
# empty; nevertheless I will use it for timing purposes anyway.


def f2(startingList):
words_without_x = []
words_with_x = []
for word in startingList:
if word.startswith('x'):
words_with_x.append(word)
else:
words_without_x.append(word)
return (words_with_x, words_without_x)

# Set up some test data. There's no point being too clever about this.
# Keep it simple.

import random
data = ['aa', 'bb', 'cb', 'xa', 'xb', 'xc']*1000000
random.shuffle(data)

# Set up two timers.
from timeit import Timer
setup = "from __main__ import data, f1, f2"
t1 = Timer("a, b = f1(data)", setup)
t2 = Timer("a, b = f2(data)", setup)

# and run the timers
best1 = min(t1.repeat(number=1, repeat=10))
best2 = min(t2.repeat(number=1, repeat=10))

# End block


On my computer, here are the results. Yours may differ.

best1: 3.5199968814849854
best2: 3.515479803085327

No significant difference. And that is to be expected: the bulk of the
time is spent building up two lists of three million items each.

So let's run it again with less data:

data = data[:10000]
best1 = min(t1.repeat(number=200, repeat=10))/200
best2 = min(t2.repeat(number=200, repeat=10))/200

which gives results:

best1: 0.0037816047668457033
best2: 0.005841898918151856

The double list comp solution is faster, but it's also incorrect -- it
fails if there is an empty string in the list. What happens if we replace
it with a version that doesn't have the empty string bug?

def f1(startingList):
return ([x for x in startingList if x.startswith('x')],
[x for x in startingList if not x.startswith('x')])

best1 = min(t1.repeat(number=200, repeat=10))/200
best2 = min(t2.repeat(number=200, repeat=10))/200


which gives these results:

best1: 0.008604295253753662
best2: 0.005863149166107178


So there's the first lesson: it's easy to be fast if you don't mind
writing buggy code.

Can we do better? Try this:


def f3(startingList):
words_with_x = []
words_without_x = []
append_with = words_with_x.append
append_without = words_without_x.append
for word in iter(startingList):
if word[:1] == 'x':
append_with(word)
else:
append_without(word)
return (words_with_x, words_without_x)

t3 = Timer('a, b = f3(data)', 'from __main__ import f3, data')
best3 = min(t3.repeat(number=200, repeat=10))/200

And the result:

best3: 0.0033271098136901855


which is even faster than your original version.

Or is it? No, I can't conclude that. The difference between the original
f1 function (0.00378s) and my f3 function (0.00332s) is too small to be
sure it is real from just ten trials of each. A better statistician than
me could probably estimate the number of trials needed to be confident
that one is better than the other.

But then, with a difference that small, who cares? In the real world, a
difference that small is lost in the noise. Because of the noise,
probably 50% of the time the slower code will finish first.


[...]
Suppose words was not a list --- you have tacitly assumed that words is
a list.

Actually, no I have not. I have assumed it is an iterable object, such as
a list, a tuple, or an iterator. So what? You have done the same thing.
Doing an isinstance type check at the beginning of both functions will
just slow them both down by the same amount.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top