nested strings

Discussion in 'Perl Misc' started by sausenet@gmail.com, Aug 5, 2008.

  1. Guest

    Could anyone suggest a nice way of nesting strings, as follows?

    Given: a list of words sorted first in length order, then
    alphabetically:
    FORMATION
    FORMATIVE
    FORMATTER
    DEFORMATION
    DEFORMATIVE
    FORMATIVELY
    INFORMATICS
    INFORMATION
    INFORMATIVE
    INFORMATORY
    CONFORMATION
    MALFORMATION
    DEFORMATIONAL
    INFORMATIONAL
    INFORMATIVELY
    UNINFORMATIVE

    Desired output: the same set of words showing nested substrings
    FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
    CONFORMATION MALFORMATION]
    FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
    PERFORMATIVE UNINFORMATIVE]
    FORMATTER
    INFORMATICS
    INFORMATORY

    Note as to form: they'll probably start as a concatenation of the
    strings separated by blanks.

    Note as to ambiguous case: yes, if the starting list has
    AAAB
    CAAA
    CAAAB
    then CAAAB could be deemed a child of AAAB or CAAA or both. Any of
    these ways of treating CAAAB would be fine.

    As a compromise, sacrificing the sub-nesting feature would be fine.
    I.e.,
    AAAB
    AAABC
    AAABCC
    becomes
    AAAB [AAABC AAABCC]

    Thanks for any help or pointers.

    Steven Alexander
     
    , Aug 5, 2008
    #1
    1. Advertising

  2. Ben Morrow Guest

    Quoth :
    > Could anyone suggest a nice way of nesting strings, as follows?
    >
    > Given: a list of words sorted first in length order, then
    > alphabetically:
    > FORMATION
    > FORMATIVE
    > FORMATTER
    > DEFORMATION
    > DEFORMATIVE
    > FORMATIVELY
    > INFORMATICS
    > INFORMATION
    > INFORMATIVE
    > INFORMATORY
    > CONFORMATION
    > MALFORMATION
    > DEFORMATIONAL
    > INFORMATIONAL
    > INFORMATIVELY
    > UNINFORMATIVE
    >
    > Desired output: the same set of words showing nested substrings
    > FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
    > CONFORMATION MALFORMATION]
    > FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
    > PERFORMATIVE UNINFORMATIVE]
    > FORMATTER
    > INFORMATICS
    > INFORMATORY
    >
    > Note as to form: they'll probably start as a concatenation of the
    > strings separated by blanks.


    I would start by maintaining a hash mapping strings to arrayrefs of
    'children', using logic something like

    use List::Util qw/first/;

    if (my $parent = first { $nextword =~ /\Q$_/ } keys %children) {
    $children{$first}{$nextword} = undef;
    }
    elsif (my $child = first { /\Q$nextword/ } keys %children) {
    $children{$nextword}{$child} = $children{$child};
    }

    I haven't tested that, and I don't think it's right, but the idea is to
    end up with a structure like

    my %children;
    %children = (
    FORMATION => {
    DEFORMATION => $children{DEFORMATION},
    INFORMATION => $children{INFORMATION},
    },
    DEFORMATION => { DEFORMATIONAL => undef },
    INFORMATION => { INFORMATIONAL => undef },
    );

    ....nah, that isn't right. I think you need two hashes: one flat, with an
    entry for each word, mapping word to parent (if any)

    my %parents = (
    FORMATION => undef,
    DEFORMATION => 'FORMATION',
    INFORMATION => 'FORMATION',
    DEFORMATIONAL => 'DEFORMATION',
    INFORMATIONAL => 'INFORMATION',
    );

    and one structured as a tree

    my %children = (
    FORMATION => {
    DEFORMATION => {
    DEFORMATIONAL => {},
    },
    INFORMATION => {
    INFORMATIONAL => {},
    },
    },
    );

    You can then see if a word matches something in the %parents hash; if it
    does, walk up the chain until you find the top, and then walk down the
    %children hash until you find the point where it fits. For instance, if
    you'd sorted 'FORMATION' and 'DEFORMATIONAL' and you read 'DEFORMATION'
    you would

    * Find that 'FORMATION' was a substring of 'DEFORMATION', and since
    'FORMATION' has no parent, start there in %children;

    * Find that 'DEFORMATIONAL', one of the children of 'FORMATION', has
    a substring 'DEFORMATION', so 'DEFORMATIONAL' (and all it's
    children) become a child of 'DEFORMATION', which becomes a child
    of 'DEFORMATIONAL's (former) parent.

    Try coding something like that, and post if you get stuck.

    Ben

    --
    Raise your hand if you're invulnerable.
    []
     
    Ben Morrow, Aug 6, 2008
    #2
    1. Advertising

  3. Ted Zlatanov Guest

    On Tue, 5 Aug 2008 14:51:27 -0700 (PDT) wrote:

    s> Given: a list of words sorted first in length order, then
    s> alphabetically:
    s> FORMATION
    s> FORMATIVE
    s> FORMATTER
    s> DEFORMATION
    s> DEFORMATIVE
    s> FORMATIVELY
    s> INFORMATICS
    s> INFORMATION
    s> INFORMATIVE
    s> INFORMATORY
    s> CONFORMATION
    s> MALFORMATION
    s> DEFORMATIONAL
    s> INFORMATIONAL
    s> INFORMATIVELY
    s> UNINFORMATIVE

    s> Desired output: the same set of words showing nested substrings
    s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
    s> CONFORMATION MALFORMATION]
    s> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
    s> PERFORMATIVE UNINFORMATIVE]
    s> FORMATTER
    s> INFORMATICS
    s> INFORMATORY

    You probably want a trie.

    http://en.wikipedia.org/wiki/Trie

    http://search.cpan.org/~avif/Tree-Trie-1.5/Trie.pm

    Ted
     
    Ted Zlatanov, Aug 6, 2008
    #3
  4. On 2008-08-05 23:42, Ben Morrow <> wrote:
    > Quoth :
    >> Could anyone suggest a nice way of nesting strings, as follows?
    >>
    >> Given: a list of words sorted first in length order, then
    >> alphabetically:

    [...]
    >>
    >> Desired output: the same set of words showing nested substrings
    >> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
    >> CONFORMATION MALFORMATION]
    >> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
    >> PERFORMATIVE UNINFORMATIVE]
    >> FORMATTER
    >> INFORMATICS
    >> INFORMATORY

    [...]
    > I think you need two hashes: one flat, with an entry for each word,
    > mapping word to parent (if any)
    >
    > my %parents = (
    > FORMATION => undef,
    > DEFORMATION => 'FORMATION',
    > INFORMATION => 'FORMATION',
    > DEFORMATIONAL => 'DEFORMATION',
    > INFORMATIONAL => 'INFORMATION',
    > );
    >
    > and one structured as a tree
    >
    > my %children = (
    > FORMATION => {
    > DEFORMATION => {
    > DEFORMATIONAL => {},
    > },
    > INFORMATION => {
    > INFORMATIONAL => {},
    > },
    > },
    > );


    You don't need the %parents hash. The %children hash together with a
    recursive function is sufficient:


    #!/usr/bin/perl
    use warnings;
    use strict;
    use Data::Dumper;

    my $tree = {};

    while (<DATA>) {
    chomp;
    addword($tree, $_);
    }
    print Dumper $tree;
    exit(0);

    sub addword {
    my ($tree, $word) = @_;
    for (keys %$tree) {
    if (index($word, $_) >= 0) {
    return addword($tree->{$_}, $word);
    }
    }
    $tree->{$word} = {};
    }

    __DATA__
    FORMATION
    FORMATIVE
    FORMATTER
    DEFORMATION
    DEFORMATIVE
    FORMATIVELY
    INFORMATICS
    INFORMATION
    INFORMATIVE
    INFORMATORY
    CONFORMATION
    MALFORMATION
    DEFORMATIONAL
    INFORMATIONAL
    INFORMATIVELY
    UNINFORMATIVE

    Printing the tree structure in the format desired by the OP is left as
    an exercise to the reader ;-).

    hp
     
    Peter J. Holzer, Aug 7, 2008
    #4
  5. On 2008-08-06 16:23, Ted Zlatanov <> wrote:
    > On Tue, 5 Aug 2008 14:51:27 -0700 (PDT) wrote:
    > s> Given: a list of words sorted first in length order, then
    > s> alphabetically:

    [...]
    > s> Desired output: the same set of words showing nested substrings
    > s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
    > s> CONFORMATION MALFORMATION]
    > s> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
    > s> PERFORMATIVE UNINFORMATIVE]
    > s> FORMATTER
    > s> INFORMATICS
    > s> INFORMATORY
    >
    > You probably want a trie.


    Isn't a trie defined as a tree where the key of the parent node is a
    prefix of the keys of its children? Here it's an infix.

    hp
     
    Peter J. Holzer, Aug 7, 2008
    #5
  6. Ted Zlatanov Guest

    On Thu, 7 Aug 2008 11:54:23 +0200 "Peter J. Holzer" <> wrote:

    PJH> On 2008-08-06 16:23, Ted Zlatanov <> wrote:
    >> On Tue, 5 Aug 2008 14:51:27 -0700 (PDT) wrote:

    s> Given: a list of words sorted first in length order, then
    s> alphabetically:
    PJH> [...]
    s> Desired output: the same set of words showing nested substrings
    s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]
    s> CONFORMATION MALFORMATION]
    s> FORMATIVE [DEFORMATIVE FORMATIVELY [INFORMATIVELY] INFORMATIVE
    s> PERFORMATIVE UNINFORMATIVE]
    s> FORMATTER
    s> INFORMATICS
    s> INFORMATORY
    >>
    >> You probably want a trie.


    PJH> Isn't a trie defined as a tree where the key of the parent node is a
    PJH> prefix of the keys of its children? Here it's an infix.

    The OP said it was OK to have just prefix matching, though infix
    matching would be nice.

    You can also just iterate over the substrings of the target, e.g. for
    INFORMATION you check for I-N-F-O-R-M-A-T-I-O-N and then
    N-F-O-R-M-A-T-I-O-N and then F-O-R-M-A-T-I-O-N, etc. in the trie.

    Ted
     
    Ted Zlatanov, Aug 7, 2008
    #6
  7. On 2008-08-07 12:59, Ted Zlatanov <> wrote:
    > On Thu, 7 Aug 2008 11:54:23 +0200 "Peter J. Holzer" <> wrote:
    >
    > PJH> On 2008-08-06 16:23, Ted Zlatanov <> wrote:
    >>> On Tue, 5 Aug 2008 14:51:27 -0700 (PDT) wrote:

    > s> Given: a list of words sorted first in length order, then
    > s> alphabetically:
    > PJH> [...]
    > s> Desired output: the same set of words showing nested substrings
    > s> FORMATION [DEFORMATION [DEFORMATIONAL] INFORMATION [INFORMATIONAL]

    [...]
    >>>
    >>> You probably want a trie.

    >
    > PJH> Isn't a trie defined as a tree where the key of the parent node is a
    > PJH> prefix of the keys of its children? Here it's an infix.
    >
    > The OP said it was OK to have just prefix matching, though infix
    > matching would be nice.


    You mean "as a compromise, sacrificing the subnesting would be fine"? I
    understood that to mean that flattening the tree to just two levels
    would be acceptable, not that prefix matching would be ok.


    > You can also just iterate over the substrings of the target, e.g. for
    > INFORMATION you check for I-N-F-O-R-M-A-T-I-O-N and then
    > N-F-O-R-M-A-T-I-O-N and then F-O-R-M-A-T-I-O-N, etc. in the trie.


    It would be quite inefficient, though, since you have to search the trie
    for a lot of substrings. I don't see a reason to do that, given that the
    correct solution is so easy (see my answer to Ben).

    hp
     
    Peter J. Holzer, Aug 7, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Russ Perry Jr
    Replies:
    2
    Views:
    4,242
    Russ Perry Jr
    Aug 20, 2004
  2. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    801
    Malcolm
    Jun 24, 2006
  3. Chad E. Dollins
    Replies:
    3
    Views:
    674
    Kai-Uwe Bux
    Nov 8, 2005
  4. request@no_spam.com
    Replies:
    5
    Views:
    453
  5. Ultrus
    Replies:
    3
    Views:
    399
    Stefan Behnel
    Jul 9, 2007
Loading...

Share This Page