nested strings

S

sausenet

Could anyone suggest a nice way of nesting strings, as follows?

Given: a list of words sorted first in length order, then
alphabetically:
FORMATION
FORMATIVE
FORMATTER
DEFORMATION
DEFORMATIVE
FORMATIVELY
INFORMATICS
INFORMATION
INFORMATIVE
INFORMATORY
CONFORMATION
MALFORMATION
DEFORMATIONAL
INFORMATIONAL
INFORMATIVELY
UNINFORMATIVE

Desired output: the same set of words showing nested substrings
FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
CONFORMATION MALFORMATION]
FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
PERFORMATIVE UNINFORMATIVE]
FORMATTER
INFORMATICS
INFORMATORY

Note as to form: they'll probably start as a concatenation of the
strings separated by blanks.

Note as to ambiguous case: yes, if the starting list has
AAAB
CAAA
CAAAB
then CAAAB could be deemed a child of AAAB or CAAA or both. Any of
these ways of treating CAAAB would be fine.

As a compromise, sacrificing the sub-nesting feature would be fine.
I.e.,
AAAB
AAABC
AAABCC
becomes
AAAB [AAABC AAABCC]

Thanks for any help or pointers.

Steven Alexander
 
B

Ben Morrow

Quoth (e-mail address removed):
Could anyone suggest a nice way of nesting strings, as follows?

Given: a list of words sorted first in length order, then
alphabetically:
FORMATION
FORMATIVE
FORMATTER
DEFORMATION
DEFORMATIVE
FORMATIVELY
INFORMATICS
INFORMATION
INFORMATIVE
INFORMATORY
CONFORMATION
MALFORMATION
DEFORMATIONAL
INFORMATIONAL
INFORMATIVELY
UNINFORMATIVE

Desired output: the same set of words showing nested substrings
FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
CONFORMATION MALFORMATION]
FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
PERFORMATIVE UNINFORMATIVE]
FORMATTER
INFORMATICS
INFORMATORY

Note as to form: they'll probably start as a concatenation of the
strings separated by blanks.

I would start by maintaining a hash mapping strings to arrayrefs of
'children', using logic something like

use List::Util qw/first/;

if (my $parent = first { $nextword =~ /\Q$_/ } keys %children) {
$children{$first}{$nextword} = undef;
}
elsif (my $child = first { /\Q$nextword/ } keys %children) {
$children{$nextword}{$child} = $children{$child};
}

I haven't tested that, and I don't think it's right, but the idea is to
end up with a structure like

my %children;
%children = (
FORMATION => {
DEFORMATION => $children{DEFORMATION},
INFORMATION => $children{INFORMATION},
},
DEFORMATION => { DEFORMATIONAL => undef },
INFORMATION => { INFORMATIONAL => undef },
);

....nah, that isn't right. I think you need two hashes: one flat, with an
entry for each word, mapping word to parent (if any)

my %parents = (
FORMATION => undef,
DEFORMATION => 'FORMATION',
INFORMATION => 'FORMATION',
DEFORMATIONAL => 'DEFORMATION',
INFORMATIONAL => 'INFORMATION',
);

and one structured as a tree

my %children = (
FORMATION => {
DEFORMATION => {
DEFORMATIONAL => {},
},
INFORMATION => {
INFORMATIONAL => {},
},
},
);

You can then see if a word matches something in the %parents hash; if it
does, walk up the chain until you find the top, and then walk down the
%children hash until you find the point where it fits. For instance, if
you'd sorted 'FORMATION' and 'DEFORMATIONAL' and you read 'DEFORMATION'
you would

* Find that 'FORMATION' was a substring of 'DEFORMATION', and since
'FORMATION' has no parent, start there in %children;

* Find that 'DEFORMATIONAL', one of the children of 'FORMATION', has
a substring 'DEFORMATION', so 'DEFORMATIONAL' (and all it's
children) become a child of 'DEFORMATION', which becomes a child
of 'DEFORMATIONAL's (former) parent.

Try coding something like that, and post if you get stuck.

Ben
 
T

Ted Zlatanov

On Tue, 5 Aug 2008 14:51:27 -0700 (PDT) (e-mail address removed) wrote:

s> Given: a list of words sorted first in length order, then
s> alphabetically:
s> FORMATION
s> FORMATIVE
s> FORMATTER
s> DEFORMATION
s> DEFORMATIVE
s> FORMATIVELY
s> INFORMATICS
s> INFORMATION
s> INFORMATIVE
s> INFORMATORY
s> CONFORMATION
s> MALFORMATION
s> DEFORMATIONAL
s> INFORMATIONAL
s> INFORMATIVELY
s> UNINFORMATIVE

s> Desired output: the same set of words showing nested substrings
s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
s> CONFORMATION MALFORMATION]
s> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
s> PERFORMATIVE UNINFORMATIVE]
s> FORMATTER
s> INFORMATICS
s> INFORMATORY

You probably want a trie.

http://en.wikipedia.org/wiki/Trie

http://search.cpan.org/~avif/Tree-Trie-1.5/Trie.pm

Ted
 
P

Peter J. Holzer

Quoth (e-mail address removed):
Could anyone suggest a nice way of nesting strings, as follows?

Given: a list of words sorted first in length order, then
alphabetically: [...]

Desired output: the same set of words showing nested substrings
FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
CONFORMATION MALFORMATION]
FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
PERFORMATIVE UNINFORMATIVE]
FORMATTER
INFORMATICS
INFORMATORY
[...]
I think you need two hashes: one flat, with an entry for each word,
mapping word to parent (if any)

my %parents = (
FORMATION => undef,
DEFORMATION => 'FORMATION',
INFORMATION => 'FORMATION',
DEFORMATIONAL => 'DEFORMATION',
INFORMATIONAL => 'INFORMATION',
);

and one structured as a tree

my %children = (
FORMATION => {
DEFORMATION => {
DEFORMATIONAL => {},
},
INFORMATION => {
INFORMATIONAL => {},
},
},
);

You don't need the %parents hash. The %children hash together with a
recursive function is sufficient:


#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;

my $tree = {};

while (<DATA>) {
chomp;
addword($tree, $_);
}
print Dumper $tree;
exit(0);

sub addword {
my ($tree, $word) = @_;
for (keys %$tree) {
if (index($word, $_) >= 0) {
return addword($tree->{$_}, $word);
}
}
$tree->{$word} = {};
}

__DATA__
FORMATION
FORMATIVE
FORMATTER
DEFORMATION
DEFORMATIVE
FORMATIVELY
INFORMATICS
INFORMATION
INFORMATIVE
INFORMATORY
CONFORMATION
MALFORMATION
DEFORMATIONAL
INFORMATIONAL
INFORMATIVELY
UNINFORMATIVE

Printing the tree structure in the format desired by the OP is left as
an exercise to the reader ;-).

hp
 
P

Peter J. Holzer

a list of words sorted first in length order, then
s> alphabetically: [...]
s> Desired output: the same set of words showing nested substrings
s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
s> CONFORMATION MALFORMATION]
s> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
s> PERFORMATIVE UNINFORMATIVE]
s> FORMATTER
s> INFORMATICS
s> INFORMATORY

You probably want a trie.

Isn't a trie defined as a tree where the key of the parent node is a
prefix of the keys of its children? Here it's an infix.

hp
 
T

Ted Zlatanov

s> Given: a list of words sorted first in length order, then
s> alphabetically:
PJH> [...]
s> Desired output: the same set of words showing nested substrings
s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
s> CONFORMATION MALFORMATION]
s> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
s> PERFORMATIVE UNINFORMATIVE]
s> FORMATTER
s> INFORMATICS
s> INFORMATORY
PJH> Isn't a trie defined as a tree where the key of the parent node is a
PJH> prefix of the keys of its children? Here it's an infix.

The OP said it was OK to have just prefix matching, though infix
matching would be nice.

You can also just iterate over the substrings of the target, e.g. for
INFORMATION you check for I-N-F-O-R-M-A-T-I-O-N and then
N-F-O-R-M-A-T-I-O-N and then F-O-R-M-A-T-I-O-N, etc. in the trie.

Ted
 
P

Peter J. Holzer

s> Given: a list of words sorted first in length order, then
s> alphabetically:
PJH> [...]
s> Desired output: the same set of words showing nested substrings
s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL] [...]
PJH> Isn't a trie defined as a tree where the key of the parent node is a
PJH> prefix of the keys of its children? Here it's an infix.

The OP said it was OK to have just prefix matching, though infix
matching would be nice.

You mean "as a compromise, sacrificing the subnesting would be fine"? I
understood that to mean that flattening the tree to just two levels
would be acceptable, not that prefix matching would be ok.

You can also just iterate over the substrings of the target, e.g. for
INFORMATION you check for I-N-F-O-R-M-A-T-I-O-N and then
N-F-O-R-M-A-T-I-O-N and then F-O-R-M-A-T-I-O-N, etc. in the trie.

It would be quite inefficient, though, since you have to search the trie
for a lot of substrings. I don't see a reason to do that, given that the
correct solution is so easy (see my answer to Ben).

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top