nested strings

sausenet · Aug 5, 2008

Could anyone suggest a nice way of nesting strings, as follows?

Given: a list of words sorted first in length order, then
alphabetically:
FORMATION
FORMATIVE
FORMATTER
DEFORMATION
DEFORMATIVE
FORMATIVELY
INFORMATICS
INFORMATION
INFORMATIVE
INFORMATORY
CONFORMATION
MALFORMATION
DEFORMATIONAL
INFORMATIONAL
INFORMATIVELY
UNINFORMATIVE

Desired output: the same set of words showing nested substrings
FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
CONFORMATION MALFORMATION]
FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
PERFORMATIVE UNINFORMATIVE]
FORMATTER
INFORMATICS
INFORMATORY

Note as to form: they'll probably start as a concatenation of the
strings separated by blanks.

Note as to ambiguous case: yes, if the starting list has
AAAB
CAAA
CAAAB
then CAAAB could be deemed a child of AAAB or CAAA or both. Any of
these ways of treating CAAAB would be fine.

As a compromise, sacrificing the sub-nesting feature would be fine.
I.e.,
AAAB
AAABC
AAABCC
becomes
AAAB [AAABC AAABCC]

Thanks for any help or pointers.

Steven Alexander

Ben Morrow · Aug 6, 2008

Quoth (e-mail address removed):

Could anyone suggest a nice way of nesting strings, as follows?

Given: a list of words sorted first in length order, then
alphabetically:
FORMATION
FORMATIVE
FORMATTER
DEFORMATION
DEFORMATIVE
FORMATIVELY
INFORMATICS
INFORMATION
INFORMATIVE
INFORMATORY
CONFORMATION
MALFORMATION
DEFORMATIONAL
INFORMATIONAL
INFORMATIVELY
UNINFORMATIVE

Desired output: the same set of words showing nested substrings
FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
CONFORMATION MALFORMATION]
FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
PERFORMATIVE UNINFORMATIVE]
FORMATTER
INFORMATICS
INFORMATORY

Note as to form: they'll probably start as a concatenation of the
strings separated by blanks.

I would start by maintaining a hash mapping strings to arrayrefs of
'children', using logic something like

use List::Util qw/first/;

if (my $parent = first { $nextword =~ /\Q$_/ } keys %children) {
$children{$first}{$nextword} = undef;
}
elsif (my $child = first { /\Q$nextword/ } keys %children) {
$children{$nextword}{$child} = $children{$child};
}

I haven't tested that, and I don't think it's right, but the idea is to
end up with a structure like

my %children;
%children = (
FORMATION => {
DEFORMATION => $children{DEFORMATION},
INFORMATION => $children{INFORMATION},
},
DEFORMATION => { DEFORMATIONAL => undef },
INFORMATION => { INFORMATIONAL => undef },
);

....nah, that isn't right. I think you need two hashes: one flat, with an
entry for each word, mapping word to parent (if any)

my %parents = (
FORMATION => undef,
DEFORMATION => 'FORMATION',
INFORMATION => 'FORMATION',
DEFORMATIONAL => 'DEFORMATION',
INFORMATIONAL => 'INFORMATION',
);

and one structured as a tree

my %children = (
FORMATION => {
DEFORMATION => {
DEFORMATIONAL => {},
},
INFORMATION => {
INFORMATIONAL => {},
},
},
);

You can then see if a word matches something in the %parents hash; if it
does, walk up the chain until you find the top, and then walk down the
%children hash until you find the point where it fits. For instance, if
you'd sorted 'FORMATION' and 'DEFORMATIONAL' and you read 'DEFORMATION'
you would

* Find that 'FORMATION' was a substring of 'DEFORMATION', and since
'FORMATION' has no parent, start there in %children;

* Find that 'DEFORMATIONAL', one of the children of 'FORMATION', has
a substring 'DEFORMATION', so 'DEFORMATIONAL' (and all it's
children) become a child of 'DEFORMATION', which becomes a child
of 'DEFORMATIONAL's (former) parent.

Try coding something like that, and post if you get stuck.

Ben

Ted Zlatanov · Aug 6, 2008

On Tue, 5 Aug 2008 14:51:27 -0700 (PDT) (e-mail address removed) wrote:

s> Given: a list of words sorted first in length order, then
s> alphabetically:
s> FORMATION
s> FORMATIVE
s> FORMATTER
s> DEFORMATION
s> DEFORMATIVE
s> FORMATIVELY
s> INFORMATICS
s> INFORMATION
s> INFORMATIVE
s> INFORMATORY
s> CONFORMATION
s> MALFORMATION
s> DEFORMATIONAL
s> INFORMATIONAL
s> INFORMATIVELY
s> UNINFORMATIVE

s> Desired output: the same set of words showing nested substrings
s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
s> CONFORMATION MALFORMATION]
s> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
s> PERFORMATIVE UNINFORMATIVE]
s> FORMATTER
s> INFORMATICS
s> INFORMATORY

You probably want a trie.

http://en.wikipedia.org/wiki/Trie

http://search.cpan.org/~avif/Tree-Trie-1.5/Trie.pm

Ted

Peter J. Holzer · Aug 7, 2008

Quoth (e-mail address removed):

Could anyone suggest a nice way of nesting strings, as follows?

Given: a list of words sorted first in length order, then
alphabetically: [...]

Desired output: the same set of words showing nested substrings
FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
CONFORMATION MALFORMATION]
FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
PERFORMATIVE UNINFORMATIVE]
FORMATTER
INFORMATICS
INFORMATORY

Click to expand...

[...]
I think you need two hashes: one flat, with an entry for each word,
mapping word to parent (if any)

my %parents = (
FORMATION => undef,
DEFORMATION => 'FORMATION',
INFORMATION => 'FORMATION',
DEFORMATIONAL => 'DEFORMATION',
INFORMATIONAL => 'INFORMATION',
);

and one structured as a tree

my %children = (
FORMATION => {
DEFORMATION => {
DEFORMATIONAL => {},
},
INFORMATION => {
INFORMATIONAL => {},
},
},
);

You don't need the %parents hash. The %children hash together with a
recursive function is sufficient:

#!/usr/bin/perl
use warnings;
use strict;
use Data:

umper;

my $tree = {};

while (<DATA>) {
chomp;
addword($tree, $_);
}
print Dumper $tree;
exit(0);

sub addword {
my ($tree, $word) = @_;
for (keys %$tree) {
if (index($word, $_) >= 0) {
return addword($tree->{$_}, $word);
}
}
$tree->{$word} = {};
}

__DATA__
FORMATION
FORMATIVE
FORMATTER
DEFORMATION
DEFORMATIVE
FORMATIVELY
INFORMATICS
INFORMATION
INFORMATIVE
INFORMATORY
CONFORMATION
MALFORMATION
DEFORMATIONAL
INFORMATIONAL
INFORMATIVELY
UNINFORMATIVE

Printing the tree structure in the format desired by the OP is left as
an exercise to the reader ;-).

hp

Peter J. Holzer · Aug 7, 2008

a list of words sorted first in length order, then
s> alphabetically: [...]
s> Desired output: the same set of words showing nested substrings
s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
s> CONFORMATION MALFORMATION]
s> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
s> PERFORMATIVE UNINFORMATIVE]
s> FORMATTER
s> INFORMATICS
s> INFORMATORY

You probably want a trie.

Isn't a trie defined as a tree where the key of the parent node is a
prefix of the keys of its children? Here it's an infix.

hp

Ted Zlatanov · Aug 7, 2008

s> Given: a list of words sorted first in length order, then
s> alphabetically:
PJH> [...]
s> Desired output: the same set of words showing nested substrings
s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
s> CONFORMATION MALFORMATION]
s> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
s> PERFORMATIVE UNINFORMATIVE]
s> FORMATTER
s> INFORMATICS
s> INFORMATORY
PJH> Isn't a trie defined as a tree where the key of the parent node is a
PJH> prefix of the keys of its children? Here it's an infix.

The OP said it was OK to have just prefix matching, though infix
matching would be nice.

You can also just iterate over the substrings of the target, e.g. for
INFORMATION you check for I-N-F-O-R-M-A-T-I-O-N and then
N-F-O-R-M-A-T-I-O-N and then F-O-R-M-A-T-I-O-N, etc. in the trie.

Ted

Peter J. Holzer · Aug 7, 2008

s> Given: a list of words sorted first in length order, then
s> alphabetically:
PJH> [...]
s> Desired output: the same set of words showing nested substrings
s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL] [...]
PJH> Isn't a trie defined as a tree where the key of the parent node is a
PJH> prefix of the keys of its children? Here it's an infix.

The OP said it was OK to have just prefix matching, though infix
matching would be nice.

You mean "as a compromise, sacrificing the subnesting would be fine"? I
understood that to mean that flattening the tree to just two levels
would be acceptable, not that prefix matching would be ok.

You can also just iterate over the substrings of the target, e.g. for
INFORMATION you check for I-N-F-O-R-M-A-T-I-O-N and then
N-F-O-R-M-A-T-I-O-N and then F-O-R-M-A-T-I-O-N, etc. in the trie.

It would be quite inefficient, though, since you have to search the trie
for a lot of substrings. I don't see a reason to do that, given that the
correct solution is so easy (see my answer to Ben).

hp

word_set = set() def should_preceed_with_an(phrase): first_word =	1	Jan 26, 2013
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 15, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Feb 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 15, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Sep 1, 2007

nested strings

sausenet

Ben Morrow

Ted Zlatanov

Peter J. Holzer

Peter J. Holzer

Ted Zlatanov

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads