Python and STL efficiency

A

Amanjit Gill

Hi, I'm learning STL and I wrote some simple code to compare the
efficiency of python and STL.

Hi, you are benching heap allocations and of course heap fragmentation.
this is what devpartner c++ profiler had to say:

Method %in % with Called Average Real Name Method Children
-----------------------------------------------------------------
RtlFreeHeap 52,8 52,8 81.034 7,6 7,7
RtlAllocateHeap 19,8 19,8 81.089 2,9 2,9
main 16,9 89,7 1 198083,2 1052376,0
ExitProcess 10,3 10,3 1 120616,8 120641,3
...

So on linux its a lot better than that, because - I think - ptmalloc is
used which is based on dmalloc which is efficient for those small size
allocs (which malloc() and free() were never designed for).

on win32, activate the low fragmenting heap or write a custom stl
allocator.

_set_sbh_threshold(128); // method 1

ULONG HeapFragValue = 2;
HeapSetInformation((HANDLE)_get_heap_handle(), // method 2
HeapCompatibilityInformation,
&HeapFragValue,
sizeof(HeapFragValue));

last non-python message from me :)
 
M

Mc Osten

Tim N. van der Leeuw said:
I have to admit to a stupid mistake, for which I feel quite ashamed - I
got the loop-size wrong in the Python code. So all Python results
posted by me were off by a factor of 10 :-(
I feel quite bad about that!

Well, this makes *my* results quite surprising.
I checked it threetimes. I loop 1000000 times in both Python and C++,
and Python here is faster.
 
M

Mc Osten

Ray said:
Great to know that my model of how the world works is still correct!
(at least in relation to Python and C++!) :)

So please explain my results. I loop the same number of times.
 
N

Neil Cerutti

So please explain my results. I loop the same number of times.

Those of you experiencing a temporary obsession with this topic
are encouraged to study The Great Language Shootout, until the
obsession goes away. ;)

Your time might not be totally wasted; an opportunity to improve
the Python solutions may present itself.

http://shootout.alioth.debian.org/
 
S

skip

Amanjit> So on linux its a lot better than that, because - I think -
Amanjit> ptmalloc is used which is based on dmalloc which is efficient
Amanjit> for those small size allocs (which malloc() and free() were
Amanjit> never designed for).

And which for Python has been further optimized in obmalloc. Again, Python
wins. ;-)

Skip
 
M

Mc Osten

Neil Cerutti said:
Those of you experiencing a temporary obsession with this topic
are encouraged to study The Great Language Shootout, until the
obsession goes away. ;)

I think that everybody knows GLS. However, when I have results different
from what I expected, I try to understand where I did the wrong
assumption.

However, a recent post kind of explains what the problem is.
 
M

Mc Osten

Neil Cerutti said:
Those of you experiencing a temporary obsession with this topic
are encouraged to study The Great Language Shootout, until the
obsession goes away. ;)

I think that everybody knows GLS. However, when I have results different
from what I expected, I try to understand where I did the wrong
assumption.

But a recent post kind of explains what the problem is.
 
J

JSprenkle

Licheng said:
Hi, I'm learning STL and I wrote some simple code to compare the
efficiency of python and STL.

//C++
#include <iostream>
#include <string>
#include <vector>
#include <set>
#include <algorithm>
using namespace std;

int main(){
vector<string> a;
for (long int i=0; i<10000 ; ++i){
a.push_back("What do you know?");
a.push_back("so long...");
a.push_back("chicken crosses road");
a.push_back("fool");
}
set<string> b(a.begin(), a.end());
unique_copy(b.begin(), b.end(), ostream_iterator<string>(cout, "\n"));
}

I think you probably want this C++ code instead:

//C++
#include <iostream>
#include <string>
#include <vector>
#include <set>
#include <algorithm>
using namespace std;

int main(){
vector<string> a;
a.reserve( 40000 ); //<------------------ note this change
for (long int i=0; i<10000 ; ++i){
a.push_back("What do you know?");
a.push_back("so long...");
a.push_back("chicken crosses road");
a.push_back("fool");
}
set<string> b(a.begin(), a.end());
unique_copy(b.begin(), b.end(), ostream_iterator<string>(cout,
"\n"));
}

It will run a lot faster if it doesn't have to keep resizing the array.
 
N

Neil Cerutti

I think you probably want this C++ code instead:

//C++
#include <iostream>
#include <string>
#include <vector>
#include <set>
#include <algorithm>
using namespace std;

int main(){
vector<string> a;
a.reserve( 40000 ); //<------------------ note this change
for (long int i=0; i<10000 ; ++i){
a.push_back("What do you know?");
a.push_back("so long...");
a.push_back("chicken crosses road");
a.push_back("fool");
}
set<string> b(a.begin(), a.end());
unique_copy(b.begin(), b.end(), ostream_iterator<string>(cout,
"\n"));
}

It will run a lot faster if it doesn't have to keep resizing
the array.

I don't see why it should run a lot faster that way.

Appending elements to a vector with push_back takes amortized
constant time. In the example above, preallocating 40000 strings
saves (probably) math.log(40000, 2) reallocations of the vector's
storage along with the consequent copy-construction and
destruction of some fixed number of strings. Most of those
reallocations take place while the vector is still proportionally
quite small.
 
B

Ben Sizer

Neil said:
I don't see why it should run a lot faster that way.

Appending elements to a vector with push_back takes amortized
constant time. In the example above, preallocating 40000 strings
saves (probably) math.log(40000, 2) reallocations of the vector's
storage along with the consequent copy-construction and
destruction of some fixed number of strings. Most of those
reallocations take place while the vector is still proportionally
quite small.

Math.log(40000, 2) is not a small number when talking about a
relatively expensive operation such as memory allocation and
deallocation. And the superfluous copying amounts to probably an extra
2^16 copies in this case - not exactly trivial.
 
R

Ray

Neil said:
I don't see why it should run a lot faster that way.

Appending elements to a vector with push_back takes amortized
constant time. In the example above, preallocating 40000 strings
saves (probably) math.log(40000, 2) reallocations of the vector's
storage along with the consequent copy-construction and
destruction of some fixed number of strings. Most of those
reallocations take place while the vector is still proportionally
quite small.

I see your logic, however it does make it run a lot faster. On my
machine, with 1000000 repetition, what used to take 5+ seconds now
takes only about 1+++ to 2 secs.
 
N

Neil Cerutti

Math.log(40000, 2) is not a small number when talking about a
relatively expensive operation such as memory allocation and
deallocation. And the superfluous copying amounts to probably
an extra 2^16 copies in this case - not exactly trivial.

Actually, I misread the suggestion.

I thought I saw:

vector<string> v;
v.reserve(40000);

instead of:

vector<string> v(40000);

The latter actually *slows* the program, according to my tests. I
think the poster meant to suggest the former, and that's what I
was discussing.

I increased the number of iterations from 10000 to 100000 for the
tests I'm reporting.

The best of five, with reserve, was 2363 clock ticks. Without
reserve, it was 3244. The suggestion to use an initial size
resulted in 3655 clocks.

So it appears you were more right (using reserve sped up the
program by 35%, just about as the numbers predict), although we
were both wrong. ;)
 
P

Pebblestone

I tested in Common Lisp and compared the result with python.

My PC is: 3.6GH Pentium4
My OS is: Ubuntu 6.06 i386
My lisp implementation is SBCL 0.9.8 for i386
My python's version is: V2.4.3
Both implementations were installed via apt-get install.

Here's my Lisp program:

++++++++++++++++++++++++++++++++++++++++++++++++++
;;; from slow to fast implementation

(proclaim '(optimize (speed 3) (debug 0) (safety 0) (space 0)))

(defun test1 ()
(let ((a (make-array 0 :adjustable t :fill-pointer 0))
(b nil))
(dotimes (i 1000000)
(progn
(vector-push-extend "What do you know" a)
(vector-push-extend "so long ..." a)
(vector-push-extend "chicken crosses road" a)
(vector-push-extend "fool" a)))
(setf b (remove-duplicates a :test #'string-equal))
(map 'vector #'print b)))

(defun test2 ()
(let ((a (make-array 0 :adjustable t :fill-pointer 0))
(b nil))
(dotimes (i 1000000)
(progn
(vector-push-extend "What do you know" a)
(vector-push-extend "so long ..." a)
(vector-push-extend "chicken crosses road" a)
(vector-push-extend "fool" a)))
(setf b (remove-duplicates a :test #'eq))
(map 'vector #'print b)))

(defun test3 ()
(let ((a (make-array 4000000 :element-type 'string
:adjustable nil
:fill-pointer 0))
(b nil))
(dotimes (i 1000000)
(progn
(vector-push "What do you know" a)
(vector-push "so long ..." a)
(vector-push "chicken crosses road" a)
(vector-push "fool" a)))
(setf b (remove-duplicates a))
(map 'vector #'print b)))

(defun test4 ()
(let ((a (make-array 4000000 :element-type 'string
:adjustable nil))
(b nil))
(dotimes (i 1000000)
(progn
(let ((j (1- (* 4 i))))
(setf (aref a (incf j)) "What do you know")
(setf (aref a (incf j)) "so long ...")
(setf (aref a (incf j)) "chicken crosses road")
(setf (aref a (incf j)) "fool"))))
(setf b (remove-duplicates a))
(map 'vector #'print b)))

+++++++++++++++++++++++++++++++++++++++++++++++++++++++

1. When using string equal comparator (test #1), the speed slows down
dramatically. 1.93 -> 9.35
2. It seems that append(vector-push) in lisp costs much more than
direct access via index.

Here's my benchmark result:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++
#1
(time (test1))

"What do you know"
"so long ..."
"chicken crosses road"
"fool"
Evaluation took:
9.346 seconds of real time
9.020564 seconds of user run time
0.328021 seconds of system run time
0 page faults and
457,565,688 bytes consed.


#2
(time (test2))

"What do you know"
"so long ..."
"chicken crosses road"
"fool"
Evaluation took:
1.931 seconds of real time
1.884118 seconds of user run time
0.048003 seconds of system run time
0 page faults and
49,561,312 bytes consed.


#3
(time (test3))

"What do you know"
"so long ..."
"chicken crosses road"
"fool"
Evaluation took:
1.721 seconds of real time
1.704107 seconds of user run time
0.016001 seconds of system run time
0 page faults and
32,012,040 bytes consed.


#4
(time (test4))

"What do you know"
"so long ..."
"chicken crosses road"
"fool"
Evaluation took:
0.843 seconds of real time
0.824051 seconds of user run time
0.020001 seconds of system run time
0 page faults and
32,012,040 bytes consed.

+++++++++++++++++++++++++++++++++++++++++++++++++++

Here's my python's code:

ef f():
a = []
for i in range(1000000):
a.append('What do you know')
a.append('so long...')
a.append('chicken crosses road')
a.append('fool')
b = set(a)
for s in b:
print s

import time
from time import clock

f_start = clock()
f()
f_end = clock()

print "Elapsed: %f seconds" % (f_end - f_start)



And benchmark result:

python -O test.py
so long...
What do you know
fool
chicken crosses road
Elapsed: 1.260000 seconds


Oops, Python beat native-compiled lisp code.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Any way,

MY CONCLUSION is: C++ is not the right model for symbolic computation.
 
B

bearophileHUGS

Pebblestone:
(defun test4 ()
(let ((a (make-array 4000000 :element-type 'string
:adjustable nil))
(b nil))
(dotimes (i 1000000)
(progn
(let ((j (1- (* 4 i))))
(setf (aref a (incf j)) "What do you know")
(setf (aref a (incf j)) "so long ...")
(setf (aref a (incf j)) "chicken crosses road")
(setf (aref a (incf j)) "fool"))))
(setf b (remove-duplicates a))
(map 'vector #'print b)))


That test4 function can be compared to this one, with explicit
preallocation (and xrange instead of range!):

def f2():
n = 1000000
a = [None] * n * 4
for i in xrange(0, n*4, 4):
a = 'What do you know'
a[i+1] = 'so long...'
a[i+2] = 'chicken crosses road'
a[i+3] = 'fool'
for s in set(a):
print s

But note this a list (that is an array, a list is a different data
structure) of python becomes filled with pointers. I don't know what
your CL does exactly.

I can also suggest you to use Psyco too here
(http://psyco.sourceforge.net/):

import psyco
psyco.bind(f2)

It makes that f2 more than twice faster here.

Bye,
bearophile
 
P

Pebblestone

But note this a list (that is an array, a list is a different data
structure) of python becomes filled with pointers. I don't know what
your CL does exactly.

I heard that python's list is implemented as adjustable array.

Here's my lisp implementation:

++++++++++++++++++
(defun test-list ()
(let ((a nil)
(b nil))
(dotimes (i 1000000)
(progn
(push "What do you know" a)
(push "so long ..." a)
(push "chicken crosses road" a)
(push "fool" a)))
(setf b (remove-duplicates a))
(map 'list #'print b)))
+++++++++++++++++++++++++

And the benchmark result:

(time (test-list))

"fool"
"chicken crosses road"
"so long ..."
"What do you know"
Evaluation took:
2.88 seconds of real time
2.744172 seconds of user run time
0.136009 seconds of system run time
0 page faults and
74,540,392 bytes consed.

++++++++++++++++++++++++++++++

BTW, I couldn't install psyco on my system (ubuntu), gcc just prompt to
me thousands of lines of errors and warnings.




Pebblestone:
(defun test4 ()
(let ((a (make-array 4000000 :element-type 'string
:adjustable nil))
(b nil))
(dotimes (i 1000000)
(progn
(let ((j (1- (* 4 i))))
(setf (aref a (incf j)) "What do you know")
(setf (aref a (incf j)) "so long ...")
(setf (aref a (incf j)) "chicken crosses road")
(setf (aref a (incf j)) "fool"))))
(setf b (remove-duplicates a))
(map 'vector #'print b)))


That test4 function can be compared to this one, with explicit
preallocation (and xrange instead of range!):

def f2():
n = 1000000
a = [None] * n * 4
for i in xrange(0, n*4, 4):
a = 'What do you know'
a[i+1] = 'so long...'
a[i+2] = 'chicken crosses road'
a[i+3] = 'fool'
for s in set(a):
print s

But note this a list (that is an array, a list is a different data
structure) of python becomes filled with pointers. I don't know what
your CL does exactly.

I can also suggest you to use Psyco too here
(http://psyco.sourceforge.net/):

import psyco
psyco.bind(f2)

It makes that f2 more than twice faster here.

Bye,
bearophile
 
P

Pebblestone

Oh, I forgot.

Your python's example (use direct index array index) of my
corresponding lisp code works slower than the version which use
'append'.

This let me think how python's list is implemented.


Anyway, python's list is surprisingly efficient.


Pebblestone:
(defun test4 ()
(let ((a (make-array 4000000 :element-type 'string
:adjustable nil))
(b nil))
(dotimes (i 1000000)
(progn
(let ((j (1- (* 4 i))))
(setf (aref a (incf j)) "What do you know")
(setf (aref a (incf j)) "so long ...")
(setf (aref a (incf j)) "chicken crosses road")
(setf (aref a (incf j)) "fool"))))
(setf b (remove-duplicates a))
(map 'vector #'print b)))


That test4 function can be compared to this one, with explicit
preallocation (and xrange instead of range!):

def f2():
n = 1000000
a = [None] * n * 4
for i in xrange(0, n*4, 4):
a = 'What do you know'
a[i+1] = 'so long...'
a[i+2] = 'chicken crosses road'
a[i+3] = 'fool'
for s in set(a):
print s

But note this a list (that is an array, a list is a different data
structure) of python becomes filled with pointers. I don't know what
your CL does exactly.

I can also suggest you to use Psyco too here
(http://psyco.sourceforge.net/):

import psyco
psyco.bind(f2)

It makes that f2 more than twice faster here.

Bye,
bearophile
 
P

Pebblestone

Here's the result:

What do you know
fool
chicken crosses road
f elapsed: 1.260000 seconds
f2 elapsed 2.110000 seconds



Pebblestone:
(defun test4 ()
(let ((a (make-array 4000000 :element-type 'string
:adjustable nil))
(b nil))
(dotimes (i 1000000)
(progn
(let ((j (1- (* 4 i))))
(setf (aref a (incf j)) "What do you know")
(setf (aref a (incf j)) "so long ...")
(setf (aref a (incf j)) "chicken crosses road")
(setf (aref a (incf j)) "fool"))))
(setf b (remove-duplicates a))
(map 'vector #'print b)))


That test4 function can be compared to this one, with explicit
preallocation (and xrange instead of range!):

def f2():
n = 1000000
a = [None] * n * 4
for i in xrange(0, n*4, 4):
a = 'What do you know'
a[i+1] = 'so long...'
a[i+2] = 'chicken crosses road'
a[i+3] = 'fool'
for s in set(a):
print s

But note this a list (that is an array, a list is a different data
structure) of python becomes filled with pointers. I don't know what
your CL does exactly.

I can also suggest you to use Psyco too here
(http://psyco.sourceforge.net/):

import psyco
psyco.bind(f2)

It makes that f2 more than twice faster here.

Bye,
bearophile
 
B

bearophileHUGS

Pebblestone:
I heard that python's list is implemented as adjustable array.

Correct, an array geometrically adjustable on the right.

Here's my lisp implementation:<

What's the memory size of a before computing b? You can compare it with
Python, that may need less memory (because the array contains
pointers).

BTW, I couldn't install psyco on my system (ubuntu), gcc just prompt to me thousands of lines of errors and warnings.<

Find a Win box ;-) It's already compiled for it (for Py 2.3, 2.4).

Your python's example (use direct index array index) of my corresponding lisp code works slower than the version which use 'append'.<

For me (a slow PC) it's almost twice faster, computer life is usually
complex.
For me using the esplicit allocation + Psyco makes that program about 4
times faster (from 8 to 2 seconds).

This let me think how python's list is implemented.<

You also have to think how the * allocation is implemented and many
other things :)
The list implementation is rather readable, Python sources are online
too.

Anyway, python's list is surprisingly efficient.<

But its access isn't that fast :) Psyco helps.

Bye,
bearophile
 
P

Pebblestone

What's the memory size of a before computing b? You can compare it with
Python, that may need less memory (because the array contains
pointers).



Here's the memory usage:

1) before the loop ( fully garbage collected)
29,052,560 bytes, 757,774 objects.

2) after the loop
103,631,952 bytes, 8,760,495 objects.

It seems A has consumed 74M bytes, 8bytes each cell. That make sense
because a cell in list consists of 2 pointers, (car cdr), and an mem
address is 32 bit.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top