domain name ordering

I

Ivan Shmakov

One more "domain name" problem: write an "inequality" predicate
to compare domain names "right to left". IOW:

my @ordered
= qw (qux.example.net bar.example.org foo.bar.example.org foo.example.org);

A trivial solution is to split /\./ each of the names, reverse
the resulting lists, and then compare them elementwise, until
either one of the lists ends (which is then the lesser one), or
an inequality is found. Hence:

sub list_cmp (&$$) {
## BTW, how do I imitate sort's own signature here?
## (i. e., make $cmp truly optional.)
my ($cmp, $a, $b) = @_;
$cmp
//= sub { $_[0] cmp $_[1]; };
for (my $i = 0; $i <= $#$a; $i++) {
## .
return 1
if ($i > $#$b);
my $v
= &$cmp ($a->[$i], $b->[$i]);
## .
return $v
if ($v != 0);
}

## .
return ($#$a < $#$b ? -1 : 0);
}

sub dns_name_cmp ($$) {
my ($a, $b) = @_;

## .
list_cmp (undef,
[reverse (split (/\./, $a, -1))],
[reverse (split (/\./, $b, -1))]);
}

## print in reverse
print (join ("\n", sort { - dns_name_cmp ($a, $b); } (@ordered)), "\n");
# foo.example.org
# foo.bar.example.org
# bar.example.org
# qux.example.net

However, doing a split on every sort iteration doesn't seem all
that sensible (or does it?) Let's try with rindex () instead:

sub dns_name_cmp ($$) {
my ($a, $b) = @_;

my ($ta, $tb)
= (-1 + length ($a), -1 + length ($b));
while ($ta >= 0 && $tb >= 0) {
## NB: $[ is deprecated, thus rindex () >= -1
my ($i, $j)
= (rindex ($a, ".", $ta),
rindex ($b, ".", $tb));
## .
my $v
= (substr ($a, 1 + $i, $ta - $i)
cmp substr ($b, 1 + $j, $tb - $j));
return $v
if ($v != 0);
($ta, $tb)
= (-1 + $i, -1 + $j);
}

## .
return ($ta > 0 ? +1
: $tb > 0 ? -1
: 0);
}

## print in reverse
print (join ("\n", sort { - dns_name_cmp ($a, $b); } (@ordered)), "\n");
# foo.example.org
# foo.bar.example.org
# bar.example.org
# qux.example.net

Hopefully I didn't miss any corner case with these.

Now, it makes me wonder if there's an easier way to write this
function...
 
D

Dr.Ruud

One more "domain name" problem: write an "inequality" predicate
to compare domain names "right to left". IOW:

my @ordered
= qw (qux.example.net bar.example.org foo.bar.example.org foo.example.org);

You can build a hash from that, for example:

my %dname = (
net => {
example => { qux => { '.' => -1 } },
},
org => {
example => {
bar => {
foo => { '.' => -1 },
'.' => -1,
},
foo => { '.' => -1 },
},
},
);



Or do
my @dnames= sort { length($b) <=> length($a) or $a cmp $b } @ordered;
to make it:
qw(foo.bar.x.org bar.x.org foo.x.org qux.x.net);
and use a 'for my $dname (@dnames) { ... } loop
that stops at the first match on
/(?:^|\.)\Q$dname$/
 
B

Bo Lindbergh

Transform each FQDN into something that will sort correctly
using the default string comparison; sort; reverse the transformation.

@ordered = map {
join ".", reverse split / /;
} sort map {
join " ", reverse split /\./;
} @unordered;


/Bo Lindbergh
 
R

Rainer Weikusat

Ivan Shmakov said:
One more "domain name" problem: write an "inequality" predicate
to compare domain names "right to left". IOW:

my @ordered
= qw (qux.example.net bar.example.org foo.bar.example.org foo.example.org);

A trivial solution is to split /\./ each of the names, reverse
the resulting lists, and then compare them elementwise, until
either one of the lists ends (which is then the lesser one), or
an inequality is found. Hence:

sub list_cmp (&$$) {
[...]

However, doing a split on every sort iteration doesn't seem all
that sensible (or does it?) Let's try with rindex () instead:


sub dns_name_cmp ($$) {
my ($a, $b) = @_;

my ($ta, $tb)
= (-1 + length ($a), -1 + length ($b));

-1 + length($a) == length($a) - 1
while ($ta >= 0 && $tb >= 0) {
## NB: $[ is deprecated, thus rindex () >= -1
my ($i, $j)
= (rindex ($a, ".", $ta),
rindex ($b, ".", $tb));
## .
my $v
= (substr ($a, 1 + $i, $ta - $i)
cmp substr ($b, 1 + $j, $tb - $j));

Have you benchmarked that? You're still doing the structure analysis
and component extraction for ever comparison and out of my head, I
wouldn't be absolutely sure that doing it this way instead of with
split is actually faster.

The obvious other idea would be to split the domain names once and
leave them this way until the list is sorted. If you could afford to
lose some speed at the expense of code clarity, this could be put in a
class overloading cmp which either constructs the 'non-splitted,
non-reversed' representation 'on demand' (=> overloaded "") or keeps
both around or constructs the 'exploded' one on demand.

Instead of reversing the splitted lists, one could also compare from
the end, possibly using negative indices ($a[$#a] is the same as $a[-1]).
 
R

Rainer Weikusat

Bo Lindbergh said:
Transform each FQDN into something that will sort correctly
using the default string comparison; sort; reverse the transformation.

@ordered = map {
join ".", reverse split / /;
} sort map {
join " ", reverse split /\./;
} @unordered;

This is a neat idea. It could be made slightly more general by using
\000 instead of the space as alternate label separator.
 
R

Rainer Weikusat

Rainer Weikusat said:
This is a neat idea. It could be made slightly more general by using
\000 instead of the space as alternate label separator.

A 'better' ('faster') idea which needs even less Perl code would be to
build a hash of sort keys and sort the domain name array based on
comparing these instead of destroying and recreating the dataset which
is supposed to be sorted:

------------------
use Benchmark;

my @in = reverse(qw(qux.example.net bar.example.org foo.bar.example.org foo.example.org));

timethese(-5,
{
sort_x => sub {
return map {
join '.', reverse split /\000/
} sort map {
join "\000", reverse split /\./
} @in;
},

sort_y => sub {
my %keys;

$keys{$_} = join("\000", reverse(split(/\./, $_))) for @in;
return sort { $keys{$a} cmp $keys{$b} } @in;
}});
 
I

Ivan Shmakov

[...]
A 'better' ('faster') idea which needs even less Perl code would be
to build a hash of sort keys and sort the domain name array based on
comparing these instead of destroying and recreating the dataset
which is supposed to be sorted:

[...]

Unless I be mistaken, sort { } () with the second variant of
dns_name_cmp I've originally posted is still faster.

$ cat < 3zytr64ehn1aqckj5mk93da6zg.perl
use common::sense;

require Benchmark;

## Courtesy of /usr/share/dict/words
my @n1 = qw {
igloo.burgs.tuft leads.polyp.lakes doily.rude.corks rains.cuds.bilks
aces.mumps.colts pink.bebop.prune coeds.troop.sees tour.hulks.rungs
vent.dial.whelk logos.loped.nor mated.bran.cited pies.loaf.rune
sperm.yelp.harms how.scaly.ante dogma.lawns.plods dell.yaw.piano
caged.daub.gores weird.sofas.arias rave.gouty.tamp sics.avows.today
crass.execs.smuts dale.real.stack boots.dew.pool demur.fancy.stoop
gruff.aspen.memo gaze.homey.sorts new.slot.queue wiry.share.gulf
phase.canny.vogue grace.eddy.beaux juts.wands.doc helm.was.ladle
};
my @n2 = qw {
bums.example.com fools.example.com sic.example.org feel.example.com
wound.example.net stop.example.com diced.example.org
purse.example.com sip.example.net poll.example.com whole.example.net
kinky.example.net homer.example.org ulnas.example.com
won.example.org chase.example.com prong.example.net
hacks.example.net link.example.com monk.example.com sexy.example.net
bogs.example.net crest.example.org rays.example.com
solar.example.com click.example.net mixer.example.com
rib.example.org pinup.example.com gees.example.net fizzy.example.com
aisle.example.org
};

## sub dns_name_cmp { ... as per ... }

sub rw_sort (@) {
my %keys;

$keys{$_}
= join ("\000", reverse (split (/\./, $_)))
for @_;
## .
return
sort { $keys{$a} cmp $keys{$b} } @_;
}

Benchmark::timethese (1024 * 1024, {
"is-1" => sub { sort dns_name_cmp (@n1); },
"is-2" => sub { sort dns_name_cmp (@n2); },
"rw-1" => sub { rw_sort (@n1); },
"rw-2" => sub { rw_sort (@n1); }
});
$ perl 3zytr64ehn1aqckj5mk93da6zg.perl
Benchmark: timing 1048576 iterations of is-1, is-2, rw-1, rw-2...
is-1: -1 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 34952533.33/s (n=1048576)
(warning: too few iterations for a reliable count)
is-2: -1 wallclock secs ( 0.16 usr + 0.00 sys = 0.16 CPU) @ 6553600.00/s (n=1048576)
(warning: too few iterations for a reliable count)
rw-1: 69 wallclock secs (69.14 usr + 0.03 sys = 69.17 CPU) @ 15159.40/s (n=1048576)
rw-2: 69 wallclock secs (68.61 usr + 0.02 sys = 68.63 CPU) @ 15278.68/s (n=1048576)
$
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Rainer Weikusat said:
Rainer Weikusat said:
Transform each FQDN into something that will sort correctly
using the default string comparison; sort; reverse the transformation.

@ordered = map {
join ".", reverse split / /;
} sort map {
join " ", reverse split /\./;
} @unordered;

This is a neat idea. It could be made slightly more general by using
\000 instead of the space as alternate label separator.

A 'better' ('faster') idea which needs even less Perl code would be to
build a hash of sort keys and sort the domain name array based on
comparing these instead of destroying and recreating the dataset which
is supposed to be sorted: [...]
my %keys;

$keys{$_} = join("\000", reverse(split(/\./, $_))) for @in;
return sort { $keys{$a} cmp $keys{$b} } @in;

Or just use a standard ST

map $$_[0],
sort { $$a[1] cmp $$b[1] }
map [ $_, join "\0", reverse split /\./ ],
@in;

I tried that. It is/ was slower for the data I tested with than the
algorithm originally posted by Bo Lindbergh.
 
R

Rainer Weikusat

Ivan Shmakov said:
Rainer Weikusat <[email protected]> writes:
[...]

A 'better' ('faster') idea which needs even less Perl code would be
to build a hash of sort keys and sort the domain name array based on
comparing these instead of destroying and recreating the dataset
which is supposed to be sorted:

[...]

Unless I be mistaken, sort { } () with the second variant of
dns_name_cmp I've originally posted is still faster.
[...]

sub rw_sort (@) {
my %keys;

$keys{$_}
= join ("\000", reverse (split (/\./, $_)))
for @_;
## .
return
sort { $keys{$a} cmp $keys{$b} } @_;
}

Benchmark::timethese (1024 * 1024, {
"is-1" => sub { sort dns_name_cmp (@n1); },
"is-2" => sub { sort dns_name_cmp (@n2); },
"rw-1" => sub { rw_sort (@n1); },
"rw-2" => sub { rw_sort (@n1); }
});

The obvious issue with this would be that you're adding a gratuitious
subroutine call per iteration for the 3rd and 4th test. But that
doesn't matter that much: Comparing the labels via substring and
stopping at the first mismatch is several orders of magnitude faster
then processing the complete data once per sort and sorting based on
the 'preprocessed' data.
 
R

Rainer Weikusat

[...]
Actually, now that I think about it, there's another way of doing this I
haven't seen anywhere else: use dualvars.

use Scalar::Util "dualvar";

sub ksort (&@) {
my $cb = shift;

map $_[$_],
sort
map dualvar($_, do {
local $_ = $_[$_];
$cb->();
}),
0..$#_;
}

my @out = ksort { join "\0", reverse split /\./ } @in;

This is a neat idea as well. I thought about attaching the indices of
the actual data items to the corresponding sort keys and generating
the sorted sequence by pulling data items from the input array based
on the index permutation in the sorted, intermediate array 'somehow'
but abandoned the idea in favor of using anonymous arrays to tack sort
key and data item together (I knew that something called 'a
Schwartzian transform' existed but had not real idea what it actually
was) because I wasn't aware of a sensible way to accomplish the other.
 
J

John W. Krahn

Ivan said:
One more "domain name" problem: write an "inequality" predicate
to compare domain names "right to left". IOW:

my @ordered
= qw (qux.example.net bar.example.org foo.bar.example.org foo.example.org);

A trivial solution is to split /\./ each of the names, reverse
the resulting lists, and then compare them elementwise, until
either one of the lists ends (which is then the lesser one), or
an inequality is found. Hence:

sub list_cmp (&$$) {
## BTW, how do I imitate sort's own signature here?
## (i. e., make $cmp truly optional.)

You mean prototype, not signature.

$ perl -le'print prototype "CORE::sort"'


But sort doesn't have one so you can't imitate something that is not there.



John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top