mixed cmp operator for sorting

M

Marc Girod

Hi,

I often have to sort strings with embedded numbers, which gets into the problem that neither cmp nor <=> is appropriate to deal with them.
Making ad-hoc sort functions is nearly trivial, but cumbersome e.g. for one-liners.

I tried thus to write a sort function which would parse out the string parts from the numeric ones.
Here is what I came up with.

-8<---------------
package UCmp;

require Exporter;
@ISA = qw(Exporter);
@EXPORT = qw(ucmp);

sub ucmp {
my $a = eval '$' . caller . '::a';
my $b = eval '$' . caller . '::b';
my @t = $a =~ /(\D*)(\d*)/g; pop @t; pop @t;
my @s = $b =~ /(\D*)(\d*)/g; pop @s; pop @s;
my $l = ((@t <= @s)? @t : @s) / 2;
my $ret;
while ($l--) {
$ret = ((shift @t) cmp (shift @s) or (shift @t or 0) <=> (shift @s or 0));
$l = 0 if $ret;
}
return ($ret or @t <=> @s);
}

1;
-8<-----------------

And a test example:

$ cat bar
#!/usr/bin/perl -w

use feature qw(say);
use UCmp;

say for sort ucmp qw(
a12b34
a2c
b23
a7
a7b
23
);
$ ./bar
23
a2c
a7
a7b
a12b34
b23

Would you care to criticize it?
Thanks,
Marc
 
M

Marc Girod

You would be better off prototyping ucmp ($$), which makes sort pass the
values to be sorted in @_. This is slightly slower than using $a and $b,
which is why it isn't the default, but using eval will be *much* slower,
as well as rather less safe.

OK. I knew of this option, but I wanted to learn about the pros and cons...
If you must access your caller's variables, you should use symrefs in
preference to eval:

my ($a, $b) = do {
no strict "refs";
my $pkg = caller;
${"$pkg\::a"}, ${"$pkg\::b"};
};

Thank you! You remind me of an advice you already gave me previously in another context!
Now, how would this compare with prototyping the function?
Why the pops?

I found that since my regexp matches an empty string, the 'g' flag will stop after matching one (which is very nice of it), leaving me with one pair of empty strings in every case...
I am just removing them.
You try string comparison first; this means numerical sections will be
sorted in string order and the numerical comparison is pointless.

No? I try it on the string token, possibly empty.
You need to consider what to do if the sections you are comparing are of
different types (a numeric vs a non-numeric section).

My regexp matching has taken care of this...
You should use || for or-as-an-expression; 'or' is for flow control.

OK. The only result there would be to change the precedence, allowing me to drop one set of parentheses...?
I don't understand what $l is doing for you here. AFAICS these two
strings

will compare equal; is that what you want?

$l is initialised to the length of the shorter string (in token pairs).
Your two strings do sort right...

$ perl -MUCmp -lE 'print for sort ucmp qw(a1b2z9 a1c3z9)'
a1b2z9
a1c3z9

The first pass of the loop will handle the two (a, 1); they compare doubly equal, but since $l is not down to 0 yet, the next pass will handle (b, 2) and (c, 3), and there the comparison of b and c will suffice:
$l will be forcibly set to 0, and since $ret is non null, the original lengths will not need to be compared.

Marc
 
M

Marc Girod

You'd have to benchmark it to see, but I would expect that all using $a
and $b saves is the cost of pushing the values on the stack and popping
them off again, which is exactly what that 'do' does. So I would expect
this to be (slightly) slower than the prototype.

Fair enough, and understood.
Ah yes; I missed that this could match the empty string. I might rather
modify the pattern so it doesn't, though that's not terribly
straightforward; something like

/(?=.)(\D*)(\d*)/gs

is probably as simple as it will go.

OK. I see this is better.
Why do you use the 's' modifier?
To sort multiline items?...
In that case: there's no need to count the loop, since Perl arrays know
how long they are:

while (@s && @t) {
$ret = ...
and last;
}

OK. This one I get.
my $ret = ...;
$ret and return $ret;

But this one I don't...
I have now:

return $ret || @t said:
(once again I ask myself why 'my' variables aren't introduced until the
end of the statement...)

And neither this comment...
Thanks!
Marc
 
U

Uri Guttman

MG> Hi, I often have to sort strings with embedded numbers, which gets
MG> into the problem that neither cmp nor <=> is appropriate to deal
MG> with them. Making ad-hoc sort functions is nearly trivial, but
MG> cumbersome e.g. for one-liners.

have you looked at Sort::Maker on cpan? you can make the sort elsewhere
and call the single code ref to sort stuff. you can even run it offline
and get the source for the sorter to use in your code.

MG> sub ucmp {
MG> my $a = eval '$' . caller . '::a';
MG> my $b = eval '$' . caller . '::b';
MG> my @t = $a =~ /(\D*)(\d*)/g; pop @t; pop @t;
MG> my @s = $b =~ /(\D*)(\d*)/g; pop @s; pop @s;

that is redundant and prone to error. that is one area sort::maker
helps. you only need to describe how to extract each key from the
elements to be sorted. the module does the rest.

uri
 
$

$Bill

Hi,

I often have to sort strings with embedded numbers, which gets into the problem that neither cmp nor <=> is appropriate to deal with them.
Making ad-hoc sort functions is nearly trivial, but cumbersome e.g. for one-liners.

I tried thus to write a sort function which would parse out the string parts from the numeric ones.
Here is what I came up with. ....
sub ucmp {
my $a = eval '$' . caller . '::a';
my $b = eval '$' . caller . '::b';
my @t = $a =~ /(\D*)(\d*)/g; pop @t; pop @t;
my @s = $b =~ /(\D*)(\d*)/g; pop @s; pop @s;
my $l = ((@t <= @s)? @t : @s) / 2;
my $ret;
while ($l--) {
$ret = ((shift @t) cmp (shift @s) or (shift @t or 0) <=> (shift @s or 0));

I might use:
$ret = ((shift @t || '') cmp (shift @s || '') or (shift @t || 0) <=> (shift @s || 0));
or:
$ret = (shift @t || '') cmp (shift @s || '');
$ret = (shift @t || 0) said:
$l = 0 if $ret;

or:
last if $ret;
}
return ($ret or @t <=> @s);
}

Interesting sub - not sure where it would come in handy though
with that specific kind of data string(s) [mixed alpha/numeric].
 
T

Tim McDaniel

say for sort ucmp qw(
a12b34
a2c
b23
a7
a7b
23
);
$ ./bar
23
a2c
a7
a7b
a12b34
b23

Would you care to criticize it?

You don't document what the result should be. I can make a guess from
that sample data, but I'd rather not guess; I'd rather have a spec
to verify results.
 
J

John W. Krahn

Marc said:
Hi,

I often have to sort strings with embedded numbers, which gets into the problem that neither cmp nor<=> is appropriate to deal with them.
Making ad-hoc sort functions is nearly trivial, but cumbersome e.g. for one-liners.

I tried thus to write a sort function which would parse out the string parts from the numeric ones.
Here is what I came up with.

-8<---------------
package UCmp;

require Exporter;
@ISA = qw(Exporter);
@EXPORT = qw(ucmp);

sub ucmp {
my $a = eval '$' . caller . '::a';
my $b = eval '$' . caller . '::b';
my @t = $a =~ /(\D*)(\d*)/g; pop @t; pop @t;
my @s = $b =~ /(\D*)(\d*)/g; pop @s; pop @s;

Better as:

my @t = $a =~ /\D+|\d+/g;
my @s = $b =~ /\D+|\d+/g;

Avoids zero length strings.



John
 
C

Charles DeRykus

...

I would like to be able to write

my $ret = ...
and return $ret;

but I can't, because Perl does not let you refer to a variable in the
same statement as it was declared. This restriction is entirely
artificial (it would be easier for perl to bring $ret into scope
immediately) and I consider it unnecessary and unhelpful, for reasons
I've explained before and so won't bore people by explaining again.

Of course it currently works fine with a previous $ret in scope.
There's no error but the sub returns "foo" instead of "bar":

{ my $ret="foo"; sub {my $ret="bar" and return $ret} }

A side-benefit would be this scope surprise would be eliminated because
the later $ret would be seen.

But, there's always some catch: you'd still need: no warnings 'syntax'
though in order to quiet: "Found = in conditional, should be ==".
 
R

Rainer Weikusat

[...]
Maybe I should write the whole thing out:

while (@s && @t) {
my $ret = ...;
$ret and return $ret;
}
return @t <=> @s;

Rather than using 'last' to get out of the loop and using $ret to carry
the return value out of the loop, simply return immediately from inside
the loop. Then, if the loop ends and the sub hasn't returned yet, do the
final comparison. This avoids having to have $ret's scope extend over
every iteration of the loop and beyond, which I consider a Good Thing.
(Limiting the scope of variables is always a Good Thing.)

Subroutines which are so complicated that people start to loose track of
which variables were being used in what places are 'always a bad thing'
and this solution is not to create more variables but the split the
sequential tapeworm into multiple subroutines (with individual sets of
variables)[*].

[*] I keep being amazed at this idea that the solution to "It is too
complicated already!" must be "Make it more complicated!" (and maybe,
add some neat warnings to help the poor guy who can stop out-clevering
himself in code ...).
I would like to be able to write

my $ret = ...
and return $ret;

but I can't, because Perl does not let you refer to a variable in the
same statement as it was declared. This restriction is entirely
artificial (it would be easier for perl to bring $ret into scope
immediately) and I consider it unnecessary and unhelpful, for reasons
I've explained before and so won't bore people by explaining again.

The reason for this is that code like this

my $v = 12;
{
my $v = $v;
print("$v\n");
}

works as intended, ie, the inner $v gets initialized to the value of the
outer $v. Granted, this is a silly example, but I've already used
similar constructs in real code.
 
M

Marc Girod

/s makes . match newline. Without it the pattern will fail to match if a
non-digit section starts with a newline.
Indeed.

Maybe I should write the whole thing out:
Thanks.

I would like to be able to write ....
I've explained before and so won't bore people by explaining again.

Thanks again.

Marc
 
T

Tim McDaniel

I would like to be able to write

my $ret = ...
and return $ret;

but I can't, because Perl does not let you refer to a variable in the
same statement as it was declared. This restriction is entirely
artificial (it would be easier for perl to bring $ret into scope
immediately)

One problem I see is with "immediately" -- with order of operations.
In the C standard, as I recall, to allow for varying orders of
execution, they had to invent the concept of "sequence point" for "at
which it is guaranteed that all side effects of previous evaluations
will have been performed, and no side effects from subsequent
evaluations have yet been performed" (to quote Wikipedia).

I don't know whether Perl always evaluates left-to-right bottom-up, or
whether anything guarantees that -- other than the things that must be
LTRBU, like ||, &&, ?:, et cetera, or for things that man perlop says
are undefined (the exact time of execution of the increment/decrement
of postfix ++ or --, or the results of << overflowing an integer).

If it is well-specified, then the order of operations is known:
$res = (my $i = 23) * f($i);
would call f with 23, but in
$res = f($i) * (my $i = 23);
then the argument of $i would not be that "my $i".

I also wonder if there's a way to leverage ++, --, or << to get
unexpected behavior, or other operators. I can't think of anything.
I thought about conditionals, but "my $x = value() if 0" is already
known to be hinky.
 
T

Tim McDaniel

Quoth (e-mail address removed):

Variable introduction happens at compile time, so evaluation order is
irrelevant. The variable should be visible to all code textually after
the my operator.

OK, "order of operations" more generally.
and perhaps-less-obviously-stupid constructions like

my ($i, $j) = (1, $i + 1);

Or

my ($i, $j) = ($j, $i);

?

There's an easy workaround (predeclare), and the current rule is
simply (in scope at end of statement). And I don't like "don't do
that but we don't or can't catch it" pitfalls, of which Perl has far
too many.
 
M

Marc Girod

What is 20% of what? What is 5% of what, and what makes you sure it's
too expensive? Are you benchmarking a sort sub which does actual work,
because I would be extremely surprised if the differences were visible
in that case.

Sorry. You are right.
I made now 4 versions of the function:

ucmp1 is the one using 'do' to initialize my ($a, $b).
ucmp is the prototyped version, getting its arguments from the stack.
ucmp2 is an attempt to avoid 'do's usage of the stack, skipping the intermediate variables:
my @t = ${"${pkg}::a"} =~ /(?=.)(\D*)(\d*)/gs;
mkcmp is an attempt to use a closure to avoid evaluating 'caller' every time.

I time with the original 6 item data:

my @data = qw( a12b34 a2c b23 a7 a7b 23 );
cmpthese(100000, {
'ucmp' => sub {1 for sort ucmp @data},
'ucmp1' => sub {1 for sort ucmp1 @data},
'ucmp2' => sub {1 for sort ucmp2 @data},
'mkcmp' => sub {my $cmp = mkcmp; 1 for sort $cmp @data}
});

The result is:

sort> ./cmpcmp
Rate ucmp1 mkcmp ucmp2 ucmp
ucmp1 9747/s -- -1% -6% -13%
mkcmp 9881/s 1% -- -4% -12%
ucmp2 10331/s 6% 5% -- -8%
ucmp 11173/s 15% 13% 8% --

So, the version using a prototype is about 15% (I got 20 yesterday) faster than the one using 'do'.
Avoiding stack manipulation is only 6% faster.
I guessed invoking 'caller' was expensive, but using a closure to avoid it involves something even more expensive (so only 1% faster than the slowest).

Of course, using different data would impact the results.

Marc
 
$

$Bill

Sorry. You are right.
I made now 4 versions of the function:

What happens if you just change the package name to main and drop
the initialization of $a/$b in the do and just let them come in as
defined variables ? Wouldn't that give you a little quicker version ?
 
P

Peter J. Holzer

I did a little testing, and once you've moved away from a plain block
the differences are pretty small and almost certainly not worth worrying
about. Anything you do by way of mucking about with caller or eval will
come out slower than using the prototype and @_.

use Benchmark "cmpthese";

my @d = qw/2 3 1 2 5 6 4/;

sub ab { $a + 1 <=> $b + 1 }
sub args ($$) { $_[0] + 1 <=> $_[1] + 1 }
my $ab = sub { $a + 1 <=> $b + 1 };
my $args = sub ($$) { $_[0] + 1 <=> $_[1] + 1 };

cmpthese -5, {
block => sub { 1 for sort { $a + 1 <=> $b + 1 } @d },
ab => sub { 1 for sort ab @d },
args => sub { 1 for sort args @d },
anonab => sub { 1 for sort $ab @d },
anonargs => sub { 1 for sort $args @d },
};

Rate ab args anonargs anonab block
ab 269915/s -- -0% -5% -6% -11%
args 270042/s 0% -- -5% -6% -11%
anonargs 283833/s 5% 5% -- -2% -6%
anonab 288225/s 7% 7% 2% -- -5%
block 302222/s 12% 12% 6% 5% --

Note that on that run $a/$b was slower than @_ in both cases; in my
tests the order of those two tends to switch randomly.

Interesting. On my machine anonargs is consistently the fastest, ab the
slowest, with the others somewhere in between (and sometimes switching
places).

But the picture changes for larger arrays (e.g. 100 or 1000 elements):

Then args and anonargs are the slowest, ab and anonab are about 10%
faster and block is about 17% faster.

Looks like the args variants have the lowest setup cost but the highest
per call cost, so they are better for (very) small arrays, while ab and
especially block are better for larger arrays.

I also find it interesting that tha anon variants seem to be a bit
faster than their counterparts, and that this effect is more pronounced
for smaller arrays. Is there some setup cost associated with a normal
sub that an anonymous sub doesn't have?

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top