remove_if with a mask

J

James Kanze

On 6/8/2010 9:13 AM, eLVa wrote:
Post what you have, post what didn't work, post your
explanation on how you expected it to work. Otherwise this
looks too much like homework, and we don't do homework.
The solution involves a predicate that should keep the second
iterator (and increment it every time the operator() is
called). The predicate can probably be const, but has to keep
non-const reference to the mask iterator. Dereference the
mask iterator to check and let the op() of your predicate
essentially return that dereferenced bit.

I don't think predicates are allowed to have mutable state. And
I don't see any requirement in the standard that remove_if visit
the elements in order. Perhaps on a multicore machine, the
library might figure out some way of parallelizing the
operations, using a distinct copy of the predicate on each core,
and each core only visiting some of the elements.

That's in theory, at least.
 
J

James Kanze

On 6/8/2010 11:09 AM, eLVa wrote:

[...]
Storing states *is* what predicates implemented as objects are
for. If you didn't need to store any state, you would use
a function as your predicate (possibly a *lambda* function).

But not mutable state, since as Leigh points out, iterators can
be copied at will (and in most implementations of remove_if,
will be---using find_if at the begining is a common
optimization).

You can probably get away with using some sort of indirection to
access the state, but formally, the standard doesn't guarantee
anything about the order remove_if visits the objects. Off
hand, I can't think of any fancy tricks which would e.g. allow
parallelization, but you can't exclude that possibility.

More generally, I find that when you're "iterating" in parallel
over several containers, if the containers support indexing,
using a single index is more intuitive. If you only have one
variable, you're guaranteed to never get out of sync.
 
V

Victor Bazarov

I don't think predicates are allowed to have mutable state. And
I don't see any requirement in the standard that remove_if visit
the elements in order. Perhaps on a multicore machine, the
library might figure out some way of parallelizing the
operations, using a distinct copy of the predicate on each core,
and each core only visiting some of the elements.

That's in theory, at least.

Hmm. I doubt it. The order of visiting is predetermined. The two
first arguments to 'remove_if' are of the kind 'ForwardIterator'. How
do you parallelize access to the sequence using ForwardIterators? They
aren't random-access ones... But then again, what do I know?

V
 
K

Kai-Uwe Bux

Victor said:
Hmm. I doubt it. The order of visiting is predetermined. The two
first arguments to 'remove_if' are of the kind 'ForwardIterator'. How
do you parallelize access to the sequence using ForwardIterators? They
aren't random-access ones... But then again, what do I know?

I don't think the ForwardIterator requirement can be taken very seriously.
All that says is that the library has to provide a generic implementation
that works for forward iterators. However, the library may also contain
overloads (maybe even found by compiler magic not available to mortals) for
other iterators. In particular, std::deque<T,A>::iterator comes to mind,
where a specialized algorithm could take advantage of the clustering of a
deque into pages. One could imagine that an implementation takes advantage
of this and employs parallelism.

As long as the standard does not specify the order of predicate evaluations
in the effects clause, I don't see how we can rely on any particular order
(at least if the iterators passed to remove_if() _happen to be_ random
access iterators, which an implementation could detect).


Best

Kai-Uwe Bux
 
K

Kai-Uwe Bux

Leigh said:
I think it depends on the algorithm: it makes sense for std::for_each and
std::generate to support functors with mutable state but I guess these are
not classed as predicates so hmmm. It would be nice if the standard was
explicit about this, shame it isn't (AFAIK).

I think, the wording of for_each() was meant to allow for stateful function
objects: (a) it specifies the order in which items are visited, (b) it says
f (the predicate) is applied (as opposed to a possible copy) and (c) it
specifies that f be returned. I agree that the standard could be much more
explicit about this (and I would prefer it that way).

And with regard to generate(), I don't see how the generality of a functor
would be of any use if it isn't allowed to have a state.
 
K

Kai-Uwe Bux

Kai-Uwe Bux said:
I think, the wording of for_each() was meant to allow for stateful
function objects: (a) it specifies the order in which items are visited,
(b) it says f (the predicate) is applied (as opposed to a possible copy)
and (c) it specifies that f be returned. I agree that the standard could
be much more explicit about this (and I would prefer it that way).

And with regard to generate(), I don't see how the generality of a functor
would be of any use if it isn't allowed to have a state.

Oops, the keycodes of my news reader tricked me.

With regard to the term "predicate" the standard says [25/7]:

The Predicate parameter is used whenever an algorithm expects a function
object that when applied to the result of dereferencing the corresponding
iterator returns a value testable as true. In other words, if an algorithm
takes Predicate pred as its argument and first as its iterator argument,
it should work correctly in the construct if (pred(*first)){...}. The
function object pred shall not apply any non-constant function through the
dereferenced iterator. This function object may be a pointer to function,
or an object of a type with an appropriate function call operator.

I don't see that a state is ruled out. Forbidden is the application of non-
const functions to dereferenced items. However, in most places where a
predicate is used, the effects clause makes so little guarantees that a
stateful predicate doesn't pay.


Best

Kai-Uwe Bux
 
P

Paul Bibbings

Victor Bazarov said:
[..]
I threw together something similar to what you have and it *doesn't*
appear to be working.

16:15:35 Paul Bibbings@JIJOU
/cygdrive/d/CPPProjects/CLCPP $cat vector_mask.cpp
// file: vector_mask.cpp

#include<vector>
#include<algorithm>
#include<iostream>

template<typename T>
class MaskFunctor {
public:
explicit MaskFunctor(std::vector<bool>& mask)
: mask_(mask)
, mask_iter_(mask_.begin())
{ }
bool operator()(const T&)
{
if (mask_iter_ == mask_.end())
return false;
else
return *mask_iter_++;
}
private:
std::vector<bool>& mask_;
std::vector<bool>::const_iterator mask_iter_;
};

int main()
{
std::vector<bool> mask;
for (int i = 0; i< 4; ++i)
{
mask.push_back(true);
mask.push_back(false);
}

std::vector<int> i_vec;
for (int i = 0; i< 8; ++i)
i_vec.push_back(i);

std::vector<int>::iterator end =
std::remove_if(i_vec.begin(), i_vec.end(),
MaskFunctor<int>(mask));

i_vec.erase(end, i_vec.end());

for (std::vector<int>::const_iterator ci = i_vec.begin();
ci != i_vec.end();
++ci)
{
std::cout<< *ci<< ' ';
}
std::cout<< '\n';
}

16:15:41 Paul Bibbings@JIJOU
/cygdrive/d/CPPProjects/CLCPP $./vector_mask
2 4 6

The implementation makes a copy of the predicate. It's important that
the predicate keeps the iterator by reference. Try rewriting it with
that in mind.

Indeed. Having MaskFunctor store a pair of references to iterators -
one being moveable from mask.begin() towards the other storing
mask.end(), for the check in op() - solves the problem with the above
code, producing:

22:43:52 Paul Bibbings@JIJOU
/cygdrive/d/CPPProjects/CLCPP $./vector_mask
1 3 5 7

Regards

Paul Bibbings
 
E

eLVa

I don't think predicates are allowed to have mutable state.  And
I don't see any requirement in the standard that remove_if visit
the elements in order.  Perhaps on a multicore machine, the
library might figure out some way of parallelizing the
operations, using a distinct copy of the predicate on each core,
and each core only visiting some of the elements.

That's in theory, at least.

Ok so what would be the best solution for this kind of use ?
Avoid remove_if altogether ??
I'm thinking of implementing it as a simple loop over the container
with an iterator, calling erase when the mask says it is necessary, it
would be simple, yet not subject to any kind of perversion related to
optimization in standard algorithms.

As I'm not an expert in C++, what would be the best way to go for this
problem ? In general, I have several containers each containing the
same number of data. One of them is used to generate a mask of
booleans for items that match some pattern. Finally all of the other
containers need to be filtered the same way I did for the first.
I thought this way was the better, but clearly it is not easy to
implement safely with the standard library, so am I going the wrong
way ??
 
Ö

Öö Tiib

Ok so what would be the best solution for this kind of use ?
Avoid remove_if altogether ??
I'm thinking of implementing it as a simple loop over the container
with an iterator, calling erase when the mask says it is necessary, it
would be simple, yet not subject to any kind of perversion related to
optimization in standard algorithms.

I would go with loops. Predicates with side effects are forbidden and
predicates with memory are confusing.
As I'm not an expert in C++, what would be the best way to go for this
problem ? In general, I have several containers each containing the
same number of data. One of them is used to generate a mask of
booleans for items that match some pattern. Finally all of the other
containers need to be filtered the same way I did for the first.
I thought this way was the better, but clearly it is not easy to
implement safely with the standard library, so am I going the wrong
way ??

Can you explain why you have several containers with exactly same
size?

On most of cases it is worth to have a class that combines all the
elements of these different containers (that you synchronize manually)
into one object. Result is one container with all data and no issues
synchronizing several containers.
 
E

eLVa

I would go with loops. Predicates with side effects are forbidden and
predicates with memory are confusing.


Can you explain why you have several containers with exactly same
size?


Well, in my field, you start with a bunch of 2d points. Then you do
some calculations and you end up with different vector of 2d or 3d
points (exactly the same number though). Next step, you filter those
base on same predicates. There, you have two choices, keeping a list
of indexes giving you the positions you're interested in ; or, coming
from the C world, you prefer to compact the representation, and delete
those you don't want and ensuring the following algorithms the
representation is continuous.
By the way, I say vector, but most of the time, I chose a list for it
is faster to remove elements inside. I try to use iterators as
parameters for my functions, rather than a specific type of
containers, even though I'm sure in this case, I could hard-code those
functions for lists as it will (normally) always be faster to delete
elements from.

Anyway, do you think I should go with an object representation ? I
would still have to think of whether I compact the representation, or
provide some kind of iterators functions that deal with gaps created
by my mask ??
 
J

James Kanze

On 6/8/2010 2:16 PM, James Kanze wrote:

[...]
Hmm. I doubt it. The order of visiting is predetermined.
The two first arguments to 'remove_if' are of the kind
'ForwardIterator'. How do you parallelize access to the
sequence using ForwardIterators? They aren't random-access
ones...

The standard allows functions to use different algorithms when
the iterators are of a more strict type. (There are even cases
where it requires it.) If you only provide ForwardIterators,
it's a lot less likely that the implementation will be able to
do anything special, but even then, who knows? Perhaps someone
will invent something fancy which will still work.
 
Ö

Öö Tiib

Well, in my field, you start with a bunch of 2d points. Then you do
some calculations and you end up with different vector of 2d or 3d
points (exactly the same number though). Next step, you filter those
base on same predicates. There, you have two choices, keeping a list
of indexes giving you the positions you're interested in ; or, coming
from the C world, you prefer to compact the representation, and delete
those you don't want and ensuring the following algorithms the
representation is continuous.

Yes but why there is always exactly same number and why you filter
them together? Also from where you get the mask that has exactly same
size too?
By the way, I say vector, but most of the time, I chose a list for it
is faster to remove elements inside. I try to use iterators as
parameters for my functions, rather than a specific type of
containers, even though I'm sure in this case, I could hard-code those
functions for lists as it will (normally) always be faster to delete
elements from.

Vector is fine. Sometimes it is hard to predict without profiling
integrated into real product with real data what container type is
best.
Anyway, do you think I should go with an object representation ? I
would still have to think of whether I compact the representation, or
provide some kind of iterators functions that deal with gaps created
by my mask ??

From performance standpoint it depends if you use the data in same
position of different containers together for something or not. Data
that is (most of the time) used together should be close together,
preferably in same struct. If you use the containers separately and
only sometimes synchronize then it is better to keep them separate.

From object-oriented standpoint i do not know what the points there
mean so i do not know if the set of things at same position in
different containers makes sense as some object-oriented abstraction
of something or not itself (often it does).

If the points stay in different containers then i would use just
simple loop that iterates over all the conteiners at same time and
throw unneeded points out from them at once when filtering.
 
K

Keith H Duggar

Ok so what would be the best solution for this kind of use ?
Avoid remove_if altogether ??
I'm thinking of implementing it as a simple loop over the container
with an iterator, calling erase when the mask says it is necessary, it
would be simple, yet not subject to any kind of perversion related to
optimization in standard algorithms.

As I'm not an expert in C++, what would be the best way to go for this
problem ? In general, I have several containers each containing the
same number of data. One of them is used to generate a mask of
booleans for items that match some pattern. Finally all of the other
containers need to be filtered the same way I did for the first.
I thought this way was the better, but clearly it is not easy to
implement safely with the standard library, so am I going the wrong
way ??

Dude, didn't you see my post? I've already given you THE solution
for these kinds of problems: Boost zip_iterator. No need to screw
around with hacked predicates; and zip_iterator is a very general
idea applicable to many needs.

KHD
 
E

eLVa

Dude, didn't you see my post? I've already given you THE solution
for these kinds of problems: Boost zip_iterator. No need to screw
around with hacked predicates; and zip_iterator is a very general
idea applicable to many needs.

Dude, didn't you see the first post in which I said I don't, and
probably won't use Boost.
Yeah, there are still people who don't want to depend on Boost.
Thanks for the solution however :)
 
E

eLVa

Yes but why there is always exactly same number and why you filter
them together?  Also from where you get the mask that has exactly same
size too?

Imagine a vector of 2d points in image. You transform them in vector
3d points based on some computation.
Then you filter the 3d points you're not interested in. Then you go
back to the vector of 2d points and filter the one you removed from
the previous step. Imagine that I have a bunch of those
transformations, you end up with several containers of several 2d and
3d points. Still you only want a subset of all of them. That's the
general idea.
Vector is fine. Sometimes it is hard to predict without profiling
integrated into real product with real data what container type is
best.

Sure. However, isn't it always true that it is faster to remove
objects from a list than a vector ?
Considering, I will probably go for a solution in which I iterate
through the points and remove them as I go, would'nt it always be
faster ??
From performance standpoint it depends if you use the data in same
position of different containers together for something or not. Data
that is (most of the time) used together should be close together,
preferably in same struct. If you use the containers separately and
only sometimes synchronize then it is better to keep them separate.

Yeah sometimes I need one container, sometimes I need another. But
most of the time, I don't need all of them at the same time.
So I don't think a packed representation of all the data in an object
is optimal.
From object-oriented standpoint i do not know what the points there
mean so i do not know if the set of things at same position in
different containers makes sense as some object-oriented abstraction
of something or not itself (often it does).

If the points stay in different containers then i would use just
simple loop that iterates over all the conteiners at same time and
throw unneeded points out from them at once when filtering.

Ok. Last question, is it faster to filter each of them independently,
or all of them together. I think that when you filter each
independently, you don't end up with lot of cache misses fetching data
from each vector, no ?
I can already hear a lot of people complaining that's context
dependent, subject to empirical thoughts, and probably a lot out of my
skills :) Still, I'm here to learn, so I'm open to critics, thoughts
and advices ^^
 
K

Keith H Duggar

Dude, didn't you see the first post in which I said I don't, and
probably won't use Boost. Yeah, there are still people who don't
want to depend on Boost. Thanks for the solution however :)

LOL ... no I actually missed the "no Boost" condition ;-)

Actually though, I am glad I did miss it and went ahead and
posted a boost solution because now this thread serves as a
great example of how much one loses, how much time one will
waste, and how much more difficult and error prone one's
work becomes when one refuses to use boost.

KHD
 
E

eLVa

LOL ... no I actually missed the "no Boost" condition ;-)

Actually though, I am glad I did miss it and went ahead and
posted a boost solution because now this thread serves as a
great example of how much one loses, how much time one will
waste, and how much more difficult and error prone one's
work becomes when one refuses to use boost.

I really don't want this post to degenerate into a fight between boost
addicts and c++ users :)

I know that there are still places where boost can't be used, that's
the case in some labs in my department.
I would be glad to use it though.

I also know that some features of boost have been integrated in the
standard library, and other will probably be added in the future.
Until then, I have to stick with the standard, and produce code
compiling without additional dependencies.

Thank you really for the input. I understand your point of view. I
hope you understand mine :)
Cheers
 
K

Keith H Duggar

I really don't want this post to degenerate into a fight
between boost addicts and c++ users :)

Yes, a false dichotomy between boost "addicts" and "c++ users".
That will certainly help to avoid degeneration ;-)
I know that there are still places where boost can't be
used, that's the case in some labs in my department.
I would be glad to use it though.

I also know that some features of boost have been integrated in the
standard library, and other will probably be added in the future.
Until then, I have to stick with the standard, and produce code
compiling without additional dependencies.

Thank you really for the input. I understand your point of
view. I hope you understand mine :)

I understand your situation and have been if that spot myself.
But that is irrelevant to the /consequences/ of not using boost.
This thread demonstrates those consequences quite well. In other
words I'm not saying should/good/bad/evil crap. I am just saying
it is painful to either refuse or not be allowed to utilize one
of the best tools we have as "c++ users": Boost.

Anyhow, if you cannot use Boost then you can code a zip_iterator
(modeled after boost) yourself. That is the right abstraction for
this problem. Not predicate hacking.

Do you agree that a zip_iterator (whether in-house or otherwise)
is a better and safer abstraction for handling this problem than
hacked predicates?

KHD
 
Ö

Öö Tiib

Imagine a vector of 2d points in image. You transform them in vector
3d points based on some computation.
Then you filter the 3d points you're not interested in. Then you go
back to the vector of 2d points and filter the one you removed from
the previous step. Imagine that I have a bunch of those
transformations, you end up with several containers of several 2d and
3d points. Still you only want a subset of all of them. That's the
general idea.


So it sounds you have pairs and if the both points were packed into
same struct (lets say "transformation") then when you kick a
transformation (because of non-pleasant 3D point properties) out then
you get 2D point kicked out automatically as well. So you do not need
to construct a filter based on 3D points that you kick out and then
apply that filter to 2D points? Or does the filter depend on other
pairs?
Sure. However, isn't it always true that it is faster to remove
objects from a list than a vector ?

Depends on size. Removal from list involves deallocation. The thing is
i do not even care about it when i do not have working prototype with
real field data examples so i can measure and profile performance. The
things are simple to play around and switch once i have information.
Considering, I will probably go for a solution in which I iterate
through the points and remove them as I go, would'nt it always be
faster ??

If the filter means that major part of container will be thrown away
with single run then i would expect copying the remainder to other
vector and swapping to be faster than modifying a list. That does not
matter without context. Further, when it matters, then there is a
boost::intrusive::list that is *really* fast. OTOH it is more complex
to use than vector and you said that you do not want to use boost.
Yeah sometimes I need one container, sometimes I need another. But
most of the time, I don't need all of them at the same time.
So I don't think a packed representation of all the data in an object
is optimal.

OK. I am still confused, but if you say so.
Ok. Last question, is it faster to filter each of them independently,
or all of them together. I think that when you filter each
independently, you don't end up with lot of cache misses fetching data
from each vector, no ?

If you visit each one once anyway then cache probably makes no
difference. KHD suggested using boost zip_iterator for efficiency. I
would not care unless i had profile that shows that i should. When i
have reasons then i summon a hero who helps me down to avoiding
latency of L1 cache line splits, when i do not have such reasons then
i ignore it all entirely.
I can already hear a lot of people complaining that's context
dependent, subject to empirical thoughts, and probably a lot out of my
skills :) Still, I'm here to learn, so I'm open to critics, thoughts
and advices ^^

You just imagine that what you write is *soft*ware. Avoid mental image
of carving it into rock. Be prepared to throw parts of it away and
write anew when needed. You lack reasons to do it one way or other
now. So you should care about expressive interface first. Simple to
read code is also simpler to modify. Only geniuses can write things
that are near perfect on first attempt.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,222
Latest member
patricajohnson51

Latest Threads

Top