Sorting records using sort()

M

Martijn Lievaart

Yes, IF there is no 'simple' way to make std::sort() work.

And IF that is the case, then shouldnt there be another version
of std::sort() to do what I want to do. Doesnt it seem natural to
have an interface in c++ that does the same thing that the c
function qsort CAN do, without actually going the "c" way??

I don't think so. Qsort is not type safe, there is no way to do so in C.
Std::sort improves on this, but you don't have a type known compile time.
Your problem is exotic in my eyes, I don't think std::sort should take it
into account.

HTH,
M4
 
P

P.J. Plauger

And IF that is the case, then shouldnt there be another version
of std::sort() to do what I want to do. Doesnt it seem natural to
have an interface in c++ that does the same thing that the c
function qsort CAN do, without actually going the "c" way??

Yes, it's called qsort. Why is it so hard for you to accept that
it's a part of Standard C++, not to mention the best fit to the
problem you describe?

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
F

Francis Glassborow

Elijah said:
Martijn Lievaart said:
On Sun, 04 Jan 2004 04:15:33 +0000, Elijah Bailey wrote:

[ csc++ removed, not relevant for what I want to say ]
First problem to qsort: It's slower than using sort()

Qsort is slower only if std::sort can inline the comparison. In this case
this seems to me that with all the other overhead qsort will actually be
faster (let alone easier).

M4

Yes, IF there is no 'simple' way to make std::sort() work.

And IF that is the case, then shouldnt there be another version
of std::sort() to do what I want to do. Doesnt it seem natural to
have an interface in c++ that does the same thing that the c
function qsort CAN do, without actually going the "c" way??

Do you think that the OP's requirements are particularly common? Rightly
or wrongly, I do not. We have a lot of work to do on the Standard C++
Library without spending time on something such as this. If someone
really needs it they can write it for themselves (std::sort() does not
rely on some special magic) If they think it likely to be of more
general use they can submit their work either to Boost or directly to
the Library Working Group of WG21.

Actually we have a perfectly good interface that does what C's qsort
does, it is called qsort. What you are asking for is something else
which seems to be of very limited utility.
 
K

kanze

(e-mail address removed) (Elijah Bailey) wrote in message
Yes, IF there is no 'simple' way to make std::sort() work.

I've not followed the discussions completely, but if I understand
correctly, you want to invoke *one* sort function for different types of
data. (This isn't the way you present it, but it comes out to the same
thing. You have data in raw memory, represented by a char[]; sometimes,
it is one type, with one size, and sometimes it is another type, with
another size.)

This is not the way std::sort works, or for that matter, not the way the
standard library in general works. The whole point of templating is
that you have a separate instance of the function for each type of data.
And IF that is the case, then shouldnt there be another version of
std::sort() to do what I want to do. Doesnt it seem natural to have
an interface in c++ that does the same thing that the c function qsort
CAN do, without actually going the "c" way??

Well, the interface in C++ would be pretty close to the interface in C
anyway -- about the only difference might be to allow a functional
object to be used instead of only a pointer to a function.
 
E

Elijah Bailey

P.J. Plauger said:
Yes, it's called qsort. Why is it so hard for you to accept that
it's a part of Standard C++, not to mention the best fit to the
problem you describe?

Here's the real problem that I need it for (Which incidentally I never
mentioned)
Consider a database of fixed record size (Because doing variable record size
is already pain, even if you had everything in memory). Now if you had
a "STL" way of doing this, then you could
run all "STL" functions on this database compared to reading the file, doing
the same thing and writing it. (Is that faster? Yes, indeed it is. Benchmark
an mmap compared to fopen/read on ur machine, and u wud hopefully agree)

Yes, its true that maybe STL was only written for In-Memory data,
and STL implementations did not care about out of memory data.
But I believe that both for inmemory and outmemory data, now or
later, locality will be imporatant. And if STL takes into account that
fact, then not only could we use it for data in memory, but also on
data that resides on disk.

I dont want the Standard people plagued with the burden of doing it
for just my case. But I believe that handling fixed size record files is
fairly common, if not as common as using sort() for instance...

And anyway, I have an implementation right now which uses qsort
to do it. I just was amazed that it was so tough to modify sort() to
do the same thing...

Thanks for your time and comments,
--Elijah
 
K

kanze

(e-mail address removed) (Elijah Bailey) wrote in message
Here's the real problem that I need it for (Which incidentally I never
mentioned) Consider a database of fixed record size (Because doing
variable record size is already pain, even if you had everything in
memory). Now if you had a "STL" way of doing this, then you could run
all "STL" functions on this database compared to reading the file,
doing the same thing and writing it. (Is that faster? Yes, indeed it
is. Benchmark an mmap compared to fopen/read on ur machine, and u wud
hopefully agree)

Do you know the structure of this data at compile time? If so, there is
no problem creating the necessary PODs and iterators, and mapping them
to the data. Theoretically, you can't guarantee the lack of padding,
but it is a technique I've used once or twice, it generally works, and
if you are using mmap, you aren't very portable anyway.

My understanding was that the size and structure of the elements was not
known until runtime. If this is the case, I really don't see any
general answer -- the STL is based on compile time type resolution.
 
K

kanze

Do you think that the OP's requirements are particularly common?

Yes and no. For the moment, I'm not completely sure what his
requirements are.

There are certainly various reasons why one might want to overlay a data
structure over pure memory. To date, I've had very good success in
doing this, provided that the struct was a POD type, and that all
alignment requirements were met in the actual data set. (Most of the
time, the struct will only contain arrays of char, so there are no
alignment requirements.)

Technically, the standard doesn't make any guarantees, but in practice,
I've had no trouble with doing things like:

struct Record
{
char state ;
char id[ 20 ] ;
char value[ 11 ] ;
// ...
} ;

std::vector< char > buffer ;
buffer.resize( recordCount * sizeof( Record ) ) ;
Record* record = reinterpret_cast< Record* >( &buffer[ 0 ] ) ;

// ...

char* buffer = mmap( ... ) ;
Record* record = reinterpret_cast< Record* >( buffer ) ;

etc. Obviously, if this works, it is trivial to create an STL
compatible iterator for the buffer, and use it with the STL algorithms.

As I said, this is useful, and it is not guaranteed by the standard. On
the other hand, I'm not convinced that it is worth the effort that might
be necessary -- while the intent seems rather straightforward, I'd hate
to have to come up with the words to express it rigorously without
limiting implementations too much otherwise.
 
J

Jeff Schwab

Elijah said:
I want to sort a set of records using STL's sort() function,
but dont see an easy way to do it.

I have a

char *data;


which has size mn bytes where m is size of the record and
n is the number of records. Both these numbers are known
only dynamically. I have a function less_than that can compare
two records of size m given the pointers to the two records.

Is there an easy way to call STL sort() on this data and sort it.
The data is big and I do NOT want to allocate a list of pointers
of size n or anything linear in size. Assume that except the data,
we do not have much space...

I thought of tricking sort() using a dummy Record class that is
templated using the size of the record...But since m can change
dynamically this doesnt work.


Thanks in advance for your comments,
--Elijah


http://home.comcast.net/~jeffrey.schwab/code/records/
 
S

Stephen Howe

Here's the real problem that I need it for (Which incidentally I never
mentioned)

You should have mentioned it. Nobody likes putting sweat and effort into
solving artificial problems and later being confronted with the "real"
problem.
I dont want the Standard people plagued with the burden of doing it
for just my case. But I believe that handling fixed size record files is
fairly common, if not as common as using sort() for instance...

And anyway, I have an implementation right now which uses qsort
to do it. I just was amazed that it was so tough to modify sort() to
do the same thing...

It is tough. sort() is templatised which means it _has_ to determine the
size of its elements at compile-time rather than run-time.
And there is no way that is likely to ever change.
Anyone who proposes adding additional dynamic run-time ideas to C++ will
normally be shot down.
Anyone who proposes adding additional static compile-time ideas to C++ will
be welcome with open arms.

To get sort() working, one of your constraints has to go.

I have sort() working with arbitrary size elements. But the price I paid was
in your original message as I have a vector of pointers to the elements. And
in the enviroment I am in, pointers are 4-bytes, the records are
considerably larger, sorting the pointers seems a good option. But with the
messages posted here, I am wondering whether I would have been better off
with qsort(). The code would be simpler. If the element size is 4 or less
bytes, I can make the additional optimisation of not using a vectors of
pointer at all but using qsort() directly on the records.

Yes, its true that maybe STL was only written for In-Memory data,
and STL implementations did not care about out of memory data.
But I believe that both for inmemory and outmemory data, now or
later, locality will be imporatant. And if STL takes into account that
fact, then not only could we use it for data in memory, but also on
data that resides on disk.

If the idea is to sort more data than fits in memory, it has been done. In
such a situation, you read as much as possible in chunks, sort, and write to
temporary files.
Later you perform an n-way merge of the temporary files, writing the final
result o some output file. All in Knuth.

Stephen Howe
 
F

Francis Glassborow

Stephen said:
If the idea is to sort more data than fits in memory, it has been done. In
such a situation, you read as much as possible in chunks, sort, and write to
temporary files.
Later you perform an n-way merge of the temporary files, writing the final
result o some output file. All in Knuth.

Even here index files are often used and when the data has been
completely processed the data file is re-written in sorted order. In
many systems writing is much more expensive than reading and so we gain
by keeping it to a minimum.

Analysing performance on systems with mixed memory speeds (all the way
from level 1 cache through to disc or even tape) does not respond well
to big 'O' formulae. Worse, it is very sensitive to the resources
currently available (I can still remember the problem I had that
resulted from a client having used a single directory on a MSDOS machine
for all his word-processing files for three years -- he had never
deleted anything and he had set the application to keep two back-ups. It
took six hours just to remove the second back-ups, and almost 24-hours
for a batch file to reorganise the directory into 26 sub-directories on
the basis of the first letter of the file names)

Even though all your memory-devices may be random access the difference
in speed may still favour treating the slowest as if it were a serial
access device.
 
J

Jeff Schwab

Elijah said:
I tried compiling it with g++ 2.96 and 3.1...both dont compile the code...
Thanks for the code thou...wud be nice if I could test it on my linux box.
--Elijah

What's the error? It works fine on Mandrake with gcc 3.3.
 
H

Hyman Rosen

Stephen said:
It is tough. sort() is templatised which means it _has_ to determine the
size of its elements at compile-time rather than run-time.

std::sort is templatized on an iterator type. "All" you need to do is
provide a random access iterator that knows how to deal with the array
of bytes in question. The value type is going to need to be a proxy
class for dealing with that underlying memory. And that size of an
element controlled by such an iterator certainly does not have to be
part of its type.
 
F

Francis Glassborow

Hyman Rosen said:
std::sort is templatized on an iterator type. "All" you need to do is
provide a random access iterator that knows how to deal with the array
of bytes in question. The value type is going to need to be a proxy
class for dealing with that underlying memory. And that size of an
element controlled by such an iterator certainly does not have to be
part of its type.

And the real benefit from using this approach is that it allows you to
use many other algorithms (and in the context of sorting it allows you
to use stable_sort and partial_sort).

I doubt that this kind of iterator is worth it for the Standard Library,
but if you use flat (probably fixed record width) databases and mmap it
could be worth the effort. However if you are concerned with performance
(in terms of speed) I suspect you would be better off looking at the
specialised sorts for data held in serial access storage.
 
J

Jeff Schwab

Martijn said:
Very nice! Shows the power of the STL with user defined iterators.

The STL has been very good to me.
So in the end, it was not that difficult to do. Nobody really commented if
this would be 100% complient, but I doubt it isn't. I retract my earlier
comment on it being a difficult technique, even though I described what
you coded.

It should be compliant, but it ain't standard-library quality. :) At
any rate, there's a lot of potential for misuse. I think the fact that
you and I had about the same idea of how this should be done does
suggest that it's a reasonable approach, though.
 
M

Martijn Lievaart

It should be compliant, but it ain't standard-library quality. :) At
any rate, there's a lot of potential for misuse. I think the fact that
you and I had about the same idea of how this should be done does
suggest that it's a reasonable approach, though.

There has been a lot of talk about iterators and proxy objects in the
past. std::vector<bool> showed that it is impossible to write a
STL-conforming container that uses proxy objects. Unfortunately, I'm a
little unclear on the details and my Meyers books seem to have gone of to
outer space somewhere.

Now what was it again why vector<bool> is not compliant? Does it apply to
this situation? Anyone knows?

M4
 
A

Allan W

(e-mail address removed) (Elijah Bailey) wrote
I want to sort a set of records using STL's sort() function,
but dont see an easy way to do it.

I really do think that qsort() is the way to go.
However, here's what you asked for.

First, you need a strucure to sort.

// Structure needed for sort.
template <int size>struct Record {
char dat[size];
bool operator<(const Record<size>&rhs) const
{ return less_than(dat,rhs.dat); }
};

On some systems (or with some compiler settings!), this might only work
correctly if size is a multiple of some padding value. I leave you to
figure out how to deal with this.

The sort itself is trivially easy.

// The sort itself.
template<int size>void dosort() {
Record<size> *dat = reinterpret_cast<Record<size>*>(data);
std::sort(dat, dat+n);
}

Unfortunately, you can't just call
dosort<m>
because m isn't a compile-time const. This makes sense if you think
about it -- your comments about speed are correct, but what you're doing
is trading speed against code size. The code is expanded in-line, which
means it needs to understand the recordsize before you begin. So, in
order to call it you're going to need a constant. Here's the only way I
can think of to convert a variable into a compile-time const.

// The only way to call it!
void callsort() {
switch (m) {
case 4: dosort< 4>(); break;
case 8: dosort< 8>(); break;
case 12: dosort<12>(); break;
case 16: dosort<16>(); break;
case 20: dosort<20>(); break;
// ... etc...
default: throw "Missing case!";
}
}

You're going to need a case for every possible value of m.

If you test the above code with Microsoft Visual C++ 6.0, you're going to
get an error. This is caused by a compiler bug; if the only place we
instantiate Record<size> is within dosort<size>, the compiler gets it
very wrong (in this case, it will silently always use Record<20>). The
solution is to instantiate Record<size> in callsort, and pass it to
dosort:

template<int size>void dosort(Record<size>unused) {
Record<size> *dat = reinterpret_cast<Record<size>*>(data);
std::sort(dat, dat+n);
}

// The only way to call it!
void callsort() {
switch (m) {
case 4: dosort(Record< 4>()); break;
case 8: dosort(Record< 8>()); break;
case 12: dosort(Record<12>()); break;
case 16: dosort(Record<16>()); break;
case 20: dosort(Record<20>()); break;
// ... etc...
default: throw "Missing case!";
}
}

By explicitly instantiating every Record<n> size, we bypass the Microsoft
compiler bug. Note: I only tested this with values that were a multiple of
4. I don't know if it would work for odd lengths or not (this would be
compiler-dependant anyway).

I think that by now you can see the problem with this approach: code bloat.
(Not talking about source code, but generated code!) All of the in-line
logic to do the sort is being instantiated 5 times. If m can have 200
possible values, the problem is 40 times worse. If m can have 5000 possible
values, you may find that your program is MUCH bigger than the database
you're trying to sort. You mentioned that you were low on space. Even if
you do have enough RAM for this, depending on how your system works,
loading all of this code into memory might actually make your program
seem sluggish.

Have you considered the many virtues of qsort()? It *is* part of C++, and
the specs match your situation so closely, it's as if they custom wrote it
for you.
 
M

Mike Bland

(e-mail address removed) (Elijah Bailey) wrote in message


Do you know the structure of this data at compile time? If so, there is
no problem creating the necessary PODs and iterators, and mapping them
to the data. Theoretically, you can't guarantee the lack of padding,
but it is a technique I've used once or twice, it generally works, and
if you are using mmap, you aren't very portable anyway.

I know I'm coming in a late here, but I managed to implement something
which sounds somewhat similar to the problem being described:

- First, I implemented a mmapped file array class after reading an
article by Matt Austern on the subject (the link I have for it,
http://www.cuj.com/experts/1907/austern.htm?topic=experts, is now
broken, apparently), rendering the mmapped implementation at least
somewhat portable by hiding it behind an interface.

- Then I created a Table class to parse out the fields and records
from a particular flat-file relational database table format which
uses the mmapped file under the hood.

- Then I created a templated Iterator class that uses an adapter
function to convert each record in a Table to the desired view, as it
were.

- Voila, now I can use mmapped files with the STL.

Granted, I'm using mostly lower_bound() and equal_range(), as the data
I deal with is read-only; but I imagine crafting an iterator to write
to fixed-length records to use with sort() and other mutating
algorithms wouldn't be such a big leap.

And, incidentally, I wonder if my Iterator doesn't just replicate
functionality in Boost, but I've yet to play with Boost that much yet
(though I've been reading up on it) and I doubt it would work in our
baseline, as we're bound to use Sun CC 5.3 for some time.

Hope this helps,

Mike
 
M

Martijn Lievaart

Granted, I'm using mostly lower_bound() and equal_range(), as the data
I deal with is read-only; but I imagine crafting an iterator to write
to fixed-length records to use with sort() and other mutating
algorithms wouldn't be such a big leap.

Remember that std::sort stores additional copies, so you'll have to create
proxy objects. Someone posted an implementation, but I cannot find it
right now (this thread has jumped over multiple groups, so it might be in
acllc-c++ or clc++m).

HTH,
M4
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,276
Latest member
Sawatmakal

Latest Threads

Top