how to do a "REAL" string compare?

N

Nobody

Heres the deal... I have an application where I have a list (as in a Windows
list control, but thats not important) displayed to the user. I sort this
list based on the list controls sort function (again, its not important that
its Windows) which ends up calling a compare function in my code:

int CompareFunc(char* str1, char* str2)
{
}

this function returns -1, 0 or 1 which gets passed on to the internal quick
sort algorithm. No problem, it all works fine.

Now I have a user in which this list displays "multi-part" items. You can
guess where this is headed :), the list ends up like this:

Item (1/100)
Item (11/100)
Item (2/100)

Now while that is a "correct" string sort, its kind of lame. I could force
the user to zero-pad or zero-pad myself, but both seem kind of hokey as I am
either putting requirements on the user or changing his item text. I'd much
rather end up with:

Item (1/100)
Item (2/100)
..
..
..
Item (11/100)

As it should. Now keep in mind that this could end up in dozens of formats,
brackets, parents, dashes, asterisks, etc or any endless supply of cutesy
characters a user might enter. Even the forward slash may not be the part
separator and there may be stuff after the part #s.

I've seen some applications do this in the past, but never saw the source
for them. How can this be sorted properly without requiring the user to
enter it in a very specific format? I could never handle every possible
format in my code. There must be some kind of cool generic way to do this.
 
I

Ioannis Vranos

Nobody said:
Heres the deal... I have an application where I have a list (as in a Windows
list control, but thats not important) displayed to the user. I sort this
list based on the list controls sort function (again, its not important that
its Windows) which ends up calling a compare function in my code:

int CompareFunc(char* str1, char* str2)
{
}

this function returns -1, 0 or 1 which gets passed on to the internal quick
sort algorithm. No problem, it all works fine.

Now I have a user in which this list displays "multi-part" items. You can
guess where this is headed :), the list ends up like this:

Item (1/100)
Item (11/100)
Item (2/100)

Now while that is a "correct" string sort, its kind of lame. I could force
the user to zero-pad or zero-pad myself, but both seem kind of hokey as I am
either putting requirements on the user or changing his item text. I'd much
rather end up with:

Item (1/100)
Item (2/100)
.
.
.
Item (11/100)

As it should. Now keep in mind that this could end up in dozens of formats,
brackets, parents, dashes, asterisks, etc or any endless supply of cutesy
characters a user might enter. Even the forward slash may not be the part
separator and there may be stuff after the part #s.

I've seen some applications do this in the past, but never saw the source
for them. How can this be sorted properly without requiring the user to
enter it in a very specific format? I could never handle every possible
format in my code. There must be some kind of cool generic way to do this.

Prefixing single digits with 0 is a fix (I assume this is what you mean
by "zero-pad").

Also prefixing with space does the same job:


#include <string>
#include <algorithm>
#include <iostream>
#include <list>

int main()
{
using namespace std;

list<string> somelist;

somelist.push_back("Item ( 2/100");
somelist.push_back("Item ( 1/100)");
somelist.push_back("Item (11/100)");

somelist.sort();

for(list<string>::const_iterator p= somelist.begin();
p!=somelist.end(); ++p)
cout<<*p<<"\n";
}


C:\c>temp
Item ( 1/100)
Item ( 2/100
Item (11/100)

C:\c>


In summary just make sure all strings share the same length.


I am interested in a cleaner approach myself too.
 
N

Nobody

Ioannis Vranos said:
Prefixing single digits with 0 is a fix (I assume this is what you mean by
"zero-pad").

Also prefixing with space does the same job:


#include <string>
#include <algorithm>
#include <iostream>
#include <list>

int main()
{
using namespace std;

list<string> somelist;

somelist.push_back("Item ( 2/100");
somelist.push_back("Item ( 1/100)");
somelist.push_back("Item (11/100)");

somelist.sort();

for(list<string>::const_iterator p= somelist.begin();
p!=somelist.end(); ++p)
cout<<*p<<"\n";
}


C:\c>temp
Item ( 1/100)
Item ( 2/100
Item (11/100)

C:\c>


In summary just make sure all strings share the same length.


I am interested in a cleaner approach myself too.

Well, as I said, I don't want to change the text the user entered. Space
padding is just as bad as zero padding :). Just finding the part number in
the string could be tough since you can have:

Item #1 of 100
Item 1/100
Item (1/100)
Item 1-100
Item 1 / 100
Item 1/ 100
Item *1/100*
Item <1/100>
Item [1/100]

etc.

You could never handle every case... and what if someone wants to be cute
and do something like:

Item [1/100] -= by Fred =-

or

Item Part # 1-100 [1/100]
Item Part # 1-100 [2/100]

etc.

Some users would follow format requirements, but I guess a huge majority
wouldn't. And a lot of cutesy users would rather put the cutesy crap on the
description then have it properly sorted.
 
J

Jonathan Turkanis

Nobody said:
Now I have a user in which this list displays "multi-part" items.
... the list ends up like this:

Item (1/100)
Item (11/100)
Item (2/100)

Now while that is a "correct" string sort, its kind of lame. I could
force the user to zero-pad or zero-pad myself, but both seem kind of
hokey as I am either putting requirements on the user or changing his
item text. I'd much rather end up with:

Item (1/100)
Item (2/100)
.
.
.
Item (11/100)

As it should.
source for them. How can this be sorted properly without requiring
the user to enter it in a very specific format? I could never handle
every possible format in my code. There must be some kind of cool
generic way to do this.

There are infinitely many criteria for sorting strings. There's no way to answer
your question without getting more of an idea of the range of possible inputs,
what they represent, and why you think various algorithms are lame.

One thing you might do is count any non-alphanumeric character as a separator,
and then lexicographically sort the resulting sequences with indivual components
ordered numrically. OTOH, maybe this, too, is lame.

Jonathan
 
I

Ivan Vecerina

Nobody said:
Heres the deal... I have an application where I have a list (as in a
Windows list control, but thats not important) displayed to the user. I
sort this list based on the list controls sort function (again, its not
important that its Windows) which ends up calling a compare function in my
code:

int CompareFunc(char* str1, char* str2)
{
}

this function returns -1, 0 or 1 which gets passed on to the internal
quick sort algorithm. No problem, it all works fine. ....
I'd much rather end up with:

Item (1/100)
Item (2/100)
.
Item (11/100) ....
I've seen some applications do this in the past, but never saw the source
for them. How can this be sorted properly without requiring the user to
enter it in a very specific format? I could never handle every possible
format in my code. There must be some kind of cool generic way to do this.

What about something like:

int CompareFunc(char const* s1, char const* s2)
{
for(;;) {
unsigned char c1 = *s1++;
unsigned char c2 = *s2++;
if( isdigit(c1) && isdigit(c2) ) {
unsigned long v1 = strtoul(s1-1,&s1,10);
unsigned long v2 = strtoul(s2-1,&s2,10);
if(v1!=v2) return (v1<v2) ? -1 : +1;
continue;
}
else if(c1!=c2) {
...compare single chars as usual...
}
else if( !c1 )
return 0; // reached end of strings
}
}



I hope this helps,
Ivan
 
N

Nobody

What about something like:

int CompareFunc(char const* s1, char const* s2)
{
for(;;) {
unsigned char c1 = *s1++;
unsigned char c2 = *s2++;
if( isdigit(c1) && isdigit(c2) ) {
unsigned long v1 = strtoul(s1-1,&s1,10);
unsigned long v2 = strtoul(s2-1,&s2,10);
if(v1!=v2) return (v1<v2) ? -1 : +1;
continue;
}
else if(c1!=c2) {
...compare single chars as usual...
}
else if( !c1 )
return 0; // reached end of strings
}
}

This would only work if *any* number was assumed to be the part #. If the
description contained any other number, this would fail.
 
N

Nobody

Jonathan Turkanis said:
There are infinitely many criteria for sorting strings. There's no way to
answer
your question without getting more of an idea of the range of possible
inputs,
what they represent, and why you think various algorithms are lame.

One thing you might do is count any non-alphanumeric character as a
separator,
and then lexicographically sort the resulting sequences with indivual
components
ordered numrically. OTOH, maybe this, too, is lame.

Jonathan

LOL, nah, all I said was zero-padding or space-padding is lame because we
dont want to change the original string or force the user to do anything
special (like zero pad the numbers themselves). The description text
represents a file or multiple files that a user uploads. Since these files
can get large, they are often split up into multiple parts.

I guess we could code this for the few most common forms and then add 'em as
we see 'em.
 
I

Ioannis Vranos

Nobody said:
LOL, nah, all I said was zero-padding or space-padding is lame because we
dont want to change the original string or force the user to do anything
special (like zero pad the numbers themselves). The description text
represents a file or multiple files that a user uploads. Since these files
can get large, they are often split up into multiple parts.

I guess we could code this for the few most common forms and then add 'em as
we see 'em.


At first I observed this:

#include <string>
#include <iostream>
#include <list>

int main()
{
using namespace std;

list<string> somelist;

somelist.push_back("Item ( 2/100)");
somelist.push_back("Item ( 1/100)");
somelist.push_back("Item (11/100)");


for(list<string>::const_iterator p= somelist.begin();
p!=somelist.end(); ++p)
{
int sum=0;

for(string::size_type i=0; i<p->size(); ++i)
sum+= p->operator[](i);

cout<<sum<<"\n";
}
}


C:\c>temp
786
785
802

C:\c>


The question is how does it scale.


Anyway let's try to break this into steps. At first, I think such a
problem would arise to strings whose prefix substring is the same.


So the first step would be to check any string up to the first digit (if
existent), and then check the rest strings if they begin in the same way.

Regular expressions can help write minimal code for this, at first you
would check if there is a match for the regular expression "\\d+"
(digit) and then find where in the string a first occurrence of it
exists, and then use the substring up to that digit to check if some
other of the rest strings begin the same way (let's suppose we have the
string ""Item ( 2/100)", we then use the regular expression "^Item (.*"
with ^ denoting the beginning of the string - and $ the end -).


Then if we found a match, we would check the rest digits.
 
E

evaned

What about if you looked for the last two numbers in the string (or
just the second to last really) and sorted on that?

The code would be similar to Ivan's, but going backwards in the string.
(I'm gonna assume you can write the actual code, so I won't.)
 
I

Ivan Vecerina

Nobody said:
This would only work if *any* number was assumed to be the part #. If the
description contained any other number, this would fail.

It depends what is considered as success.
Did I miss something, or is this a new requirement that wasn't in
your initial spec? I don't know, and from what I read in your posts,
I wouldn't be able to say what is a part number and what isn't!

If you want some numbers to be sorted by value, and others
lexicographically,
you obviously need to give the comparison function some knowledge about your
field's formatting.
Just add a criterion to conditionally enable the value-based comparison.

For example, if you only want the second-last number in the string
to be specially treated as a (part) number, you could pre-scan each
string (backwards) to identify the start of that number - and do the
value-based comparison only when you reach that point.

Ivan
 
C

Chris Theis

Nobody said:
This would only work if *any* number was assumed to be the part #. If the
description contained any other number, this would fail.

As you can easily see there is simply no way to create a sort that handles
all kinds of situations without *any* knowledge. This is impossible by the
nature of sorting, which implicitly requires at least some information ;-)
What you could do for example, is to resort to an associative container with
keys (like a map). The keys are filled with the minimum information that you
try to extract from the values and sorting is based on the keys. This way
you do not impose too many restrictive requirements on your clients but,
naturally, the success depends very much on you "information gathering
scheme".

Cheers
Chris
 
G

Guest

Nobody said:
Heres the deal... I have an application where I have a list (as in a Windows
list control, but thats not important) displayed to the user. I sort this
list based on the list controls sort function (again, its not important that
its Windows) which ends up calling a compare function in my code:

int CompareFunc(char* str1, char* str2)
{
}

this function returns -1, 0 or 1 which gets passed on to the internal quick
sort algorithm. No problem, it all works fine.

It works fine, but I would expect it to work for 'char*' types, apparently
they are C strings as you say.
Now I have a user in which this list displays "multi-part" items. You can
guess where this is headed :), the list ends up like this:

Item (1/100)
Item (11/100)
Item (2/100)

I think the problem here is treating the data as C strings. These are not
strings but string representations of complex objects. I don't think you
should compare them as strings; because as you note, it doesn't work for
this type of data.
Now while that is a "correct" string sort, its kind of lame. I could force
the user to zero-pad or zero-pad myself, but both seem kind of hokey as I am
either putting requirements on the user or changing his item text. I'd much
rather end up with:

Item (1/100)
Item (2/100)
.
.
.
Item (11/100)

As it should. Now keep in mind that this could end up in dozens of formats,
brackets, parents, dashes, asterisks, etc or any endless supply of cutesy
characters a user might enter. Even the forward slash may not be the part
separator and there may be stuff after the part #s.

It means that the data will be parsed differently in each case. (Boost's
spirit or some other parser might help.)
I've seen some applications do this in the past, but never saw the source
for them. How can this be sorted properly without requiring the user to
enter it in a very specific format?

The data can be sorted properly by first parsing the input, then creating
the objects, and finaly comparing the objects.

Ali
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top