seeking advice on problem difficulty

Discussion in 'Perl Misc' started by Rainer Weikusat, Aug 4, 2011.

  1. "ela" <> writes:
    > I've been working on this problem for 4 days and still cannot come out a
    > good solution and would appreciate if you could comment on the problem.
    >
    > Given a table containing cells delimited by tab like this


    [ please see original for the indeed gory details ]

    Provided I understood the problem correctly, a possible solution could
    look like this (this code has had very little testing): First, you
    define your groups by associating array references containing the group
    with the 'group ID' with the help of a hash:

    $grp{1} = [1, 2];

    Then, you create a hash mapping the column name to the column value
    for each ID and put these hashes into an id hash associated with the
    ID:

    $id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
    $id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

    Provided this has been done, a 'consistency' routine can be defined as

    sub consistency($$)
    {
    my ($grp, $col) = @_;
    my %seen;

    $seen{$_} = 1
    for (map { $id{$_}{$col} } @{$grp{$grp}});

    return 1.0 / keys(%seen);
    }

    This takes a group ID and a column name as argument and returns the
    'consistency' of this column for this group. Then, an array needs to
    be created which names the columns in the order they are supposed to
    be checked in:

    @order = qw(F3 F2 F1);

    Now, a 'decide' routine can be defined like this:

    sub decide($)
    {
    my $grp = $_[0];

    consistency($grp, $_) >= THRESHOLD and return $_
    for (@order);

    return undef;
    }

    This takes a group ID as argument and returns either the name of the
    first column (checked in the order given by @order) whose consistency
    is >= the THRESHOLD or undef :):= inconsistent). As a complete script:

    ------------------
    #!/usr/bin/perl

    use constant THRESHOLD => 0.7;

    my (%grp, %id, @order, $res);

    @order = qw(F3 F2 F1);

    $grp{1} = [1, 2];

    $id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
    $id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

    sub consistency($$)
    {
    my ($grp, $col) = @_;
    my %seen;

    $seen{$_} = 1
    for (map { $id{$_}{$col} } @{$grp{$grp}});

    return 1.0 / keys(%seen);
    }

    sub decide($)
    {
    my $grp = $_[0];

    consistency($grp, $_) >= THRESHOLD and return $_
    for (@order);

    return undef;
    }

    $res = decide(1);
    $res //= 'inconsistent';

    print("$res\n");
    --------------------

    NB: This might not be a very sensible algorithm (ie, "I don't know
    this").
     
    Rainer Weikusat, Aug 4, 2011
    #1
    1. Advertising

  2. "ela" <> writes:
    > I've been working on this problem for 4 days and still cannot come out a
    > good solution and would appreciate if you could comment on the problem.
    >
    > Given a table containing cells delimited by tab like this


    [ please see original for the indeed gory details ]

    Provided I understood the problem correctly, a possible solution could
    look like this (this code has had very little testing): First, you
    define your groups by associating array references containing the group
    members with the 'group ID' with the help of a hash:

    $grp{1} = [1, 2];

    Then, you create a hash mapping the column name to the column value
    for each ID and put these hashes into an id hash associated with the
    ID:

    $id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
    $id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

    Provided this has been done, a 'consistency' routine can be defined as

    sub consistency($$)
    {
    my ($grp, $col) = @_;
    my %seen;

    $seen{$_} = 1
    for (map { $id{$_}{$col} } @{$grp{$grp}});

    return 1.0 / keys(%seen);
    }

    This takes a group ID and a column name as argument and returns the
    'consistency' of this column for this group. Then, an array needs to
    be created which names the columns in the order they are supposed to
    be checked in:

    @order = qw(F3 F2 F1);

    Now, a 'decide' routine can be defined like this:

    sub decide($)
    {
    my $grp = $_[0];

    consistency($grp, $_) >= THRESHOLD and return $_
    for (@order);

    return undef;
    }

    This takes a group ID as argument and returns either the name of the
    first column (checked in the order given by @order) whose consistency
    is >= the THRESHOLD or undef :):= inconsistent). As a complete script:

    ------------------
    #!/usr/bin/perl

    use constant THRESHOLD => 0.7;

    my (%grp, %id, @order, $res);

    @order = qw(F3 F2 F1);

    $grp{1} = [1, 2];

    $id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
    $id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

    sub consistency($$)
    {
    my ($grp, $col) = @_;
    my %seen;

    $seen{$_} = 1
    for (map { $id{$_}{$col} } @{$grp{$grp}});

    return 1.0 / keys(%seen);
    }

    sub decide($)
    {
    my $grp = $_[0];

    consistency($grp, $_) >= THRESHOLD and return $_
    for (@order);

    return undef;
    }

    $res = decide(1);
    $res //= 'inconsistent';

    print("$res\n");
    --------------------

    NB: This might not be a very sensible algorithm (ie, "I don't know
    this").
     
    Rainer Weikusat, Aug 4, 2011
    #2
    1. Advertising

  3. "ela" <> writes:

    > I've been working on this problem for 4 days and still cannot come out a
    > good solution and would appreciate if you could comment on the problem.
    >
    > Given a table containing cells delimited by tab like this (the 1st line
    > being the header and below I use delimiter space for clarity):
    >
    > ID F1 F2 F3 F4 F5
    > 1 SuperC3 C1 subC4 dummy hi
    > 1 SuperC3 C1 subC3 dumdum hell
    > 2 SuperC3 C1 subC3 hello hello
    > 3 SuperC3 C2 subC7 hel hel


    You have a duplicate ID. If this is not a typo, please ignore the rest
    of the post -- I've assumed that IDs are unique.

    > ...
    > 1000 SuperC1 C8 subC10 hi hi
    >
    >
    > and I have another table that group the ID's together, e.g.
    > Group 1:1,2, 16, 200
    > Group 2:99, 136, 555
    > ...
    > Group 15: 123, 124, 999
    >
    >
    > The two tables above can contain non-overlapping entries though most of the
    > time the entries do overlap. So the task is to make use of the 2nd table and
    > to look up the first table for evaluation. By checking group 1, we know that
    > ID's 1,2, 16 and 200 are in the same group. And what to evaluate? Say, if
    > ID's 16 and 200 also contain subC3 in F3, then we can assign this group 1
    > members to be subC3. If the consistency does not exceed certain threshold
    > (e.g. 0.7), we then look up F2 and check the statistics again and may
    > conclude C1, and finally for F1 (SuperC3). If F1 consistency also cannot
    > pass the threshold, "inconsistent" is returned. The main problem here is: I
    > have been overwhelmed by hash of hash of hash !


    I would not use hashes for this data. It is naturally an array of
    arrays. Because you need to scan a column at selected row positions,
    I'd make it an array of columns:

    # after skipping the first line...
    my @column = ();
    while (<>) {
    my $c = 0;
    push(@{$column[$c++]}, $_) foreach split;
    }

    Now you can use a slice to extract the data for a particular column.
    Having got your group list as an array:

    my @group = map {$_ - 1} split /,\s*/, $group_string;

    (the -1 is to adjust for zero based indexing) you can write

    @{$column[0]}[@group]

    to get the items from the first column whose frequencies you are
    interested in.

    <snip code>
    > for $cid (keys %chash) {
    > if (exists $group{$cid}) {
    > for $f1 (keys %{%chash{$cid}{F1} {
    > for $gid (keys %($vhash{$group{$cid}}) {
    > ....


    Functions are crucial to managing complexity. I'd want a function
    'most_frequent' that can take an array of values and find the frequency
    of the most common value among them. It could return both that value
    and the frequency. Something like:

    sub most_frequent
    {
    my ($most_freq, %count) = ('', '' => 0);
    for my $item (@_) {
    $most_freq = $item if ++$count{$item} > $count{$most_freq};
    }
    return ($most_freq, $count{$most_freq}/@_);
    }

    This is passed a slice for a given column and group:

    most_frequent(@{$column[$col]}[@group])

    With that function you simply need to find the first $col whose most
    frequent item meets your threshold. 'First' may not be correct, since
    in your example you consider F3 before F1. Maybe you need the item from
    any column that has the greatest frequency? Maybe there is a specified
    order in which the columns should be tested? Whatever the answer, it
    should be easy with the above function.

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Aug 4, 2011
    #3
  4. Ben Bacarisse <> writes:

    [...]


    > my @column = ();


    There is little reason to empty an empty array by assigning an empty
    list to it.

    [...]

    > sub most_frequent
    > {
    > my ($most_freq, %count) = ('', '' => 0);


    A more sensible initialization would be

    my ($most_freq, %count);

    $most_freq = shift;
    $count{$most_freq} = 1;

    > for my $item (@_) {
    > $most_freq = $item if ++$count{$item} > $count{$most_freq};
    > }
    > return ($most_freq, $count{$most_freq}/@_);


    and then use (@_ + 1) for the division.
     
    Rainer Weikusat, Aug 4, 2011
    #4
  5. "ela" <> writes:
    > "Rainer Weikusat" <> wrote in message
    > news:...
    >> Then, you create a hash mapping the column name to the column value
    >> for each ID and put these hashes into an id hash associated with the
    >> ID:
    >>
    >> $id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };

    >
    > Well, the duplicate ID curse appears here... so maybe an array has to use
    > instead?
    >
    > $id{1}[0] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
    > $id{1}[1] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC5' };
    > $id{2}[0] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };
    >
    >> $seen{$_} = 1 for (map { $id{$_}{$col} } @{$grp{$grp}});
    >> return 1.0 / keys(%seen);
    >> This takes a group ID and a column name as argument and returns the
    >> 'consistency' of this column for this group.

    >
    > In fact, I don't quite understand what these codes are doing. what does
    > $grp{$grp} refer to?


    Using the same name twice was arguably somewhat confusing: The inner
    $grp refers to the scalar variable $grp (defined in the function) that
    holds the group ID and it is used to select the value (array of IDs)
    associated with the corresponding key in the %grp hash.

    > And as it's no longer $id{$_} but $id{}[], so how to
    > adjust then?


    Provided that all different 'column lists' associated with the same ID
    can be treated as if they were associated with different IDs, you
    could turn the map expression into something like

    map { map { $_->{$col} } @{$id{$_}}; } @{$grp{$grp}}

    (uncompiled).

    >> my $grp = $_[0];

    > Why do you make it an array element?


    Make what? This line defined a scalar variable named $grp and assigned
    the first argument of the sub to it: All arguments are passed in the
    array @_, hence $_[0] is the first argument.

    >> consistency($grp, $_) >= THRESHOLD and return $_ for (@order);

    > the exact item (here, say, C1, in addition to F2) should also be
    > returned.... (Well, that's why I said the problem is complicated...)


    Then, you'll need something similar to Ben's most_frequent routine.

    >> $res //= 'inconsistent';

    > Is this a typo? What's this?


    // is similar to || except that it tests for 'definedness' and not
    'trueness': It will assign the string to the variable if its value is
    undef.
     
    Rainer Weikusat, Aug 4, 2011
    #5
  6. >>>>> "ela" == ela <> writes:

    ela> but this error "Search pattern not terminated at test.pl"
    ela> occurs...

    Upgrade your Perl to 5.10 or later.

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <> <URL:http://www.stonehenge.com/merlyn/>
    Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
    See http://methodsandmessages.posterous.com/ for Smalltalk discussion
     
    Randal L. Schwartz, Aug 4, 2011
    #6
  7. "ela" <> writes:

    > "Ben Bacarisse" <> wrote in message
    > news:...
    >> my ($most_freq, %count) = ('', '' => 0);

    > I guess the above line is doing some sort of initialization of zero's, but
    > why is ", " used? Making both $most_freq and %count separated by comma to be
    > zero?


    It's equivalent to

    my $most_freq = '';
    my %count;
    $count{$most_freq} = 0;

    but that's not a very good choice. I should have done what one does
    with functions like max and min and set $most_freq to the first data
    item as Rainer suggested.

    >> for my $item (@_) {

    > How does this @_ correspond to @{$column[$col]}[@group] below?


    That's how subroutines and argument passing work in Perl. I don't know
    what else to say. It's too long since I read a Perl text so if I try to
    explain it I'll probably use the wrong terms but let me try...
    Subroutine arguments are evaluated in list context so any arrays you
    write there get joined together (along with non-array arguments) into
    one big argument array. Inside the subroutine The name for this array
    is @_. In this case there is only one argument: this list that results
    from slicing the array @{$column[$col]} with the @group array.

    Did you catch the bit where I said to ignore what I wrote if the
    duplicate IDs were not a typo? I really meant it. I don't think
    slicing a column array is the right way to go with the data you have.
    You might still find most_frequent a useful function, but that's about
    it.

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Aug 4, 2011
    #7
  8. Rainer Weikusat

    ela Guest

    Dear all,

    I've been working on this problem for 4 days and still cannot come out a
    good solution and would appreciate if you could comment on the problem.

    Given a table containing cells delimited by tab like this (the 1st line
    being the header and below I use delimiter space for clarity):

    ID F1 F2 F3 F4 F5
    1 SuperC3 C1 subC4 dummy hi
    1 SuperC3 C1 subC3 dumdum hell
    2 SuperC3 C1 subC3 hello hello
    3 SuperC3 C2 subC7 hel hel
    ....
    1000 SuperC1 C8 subC10 hi hi


    and I have another table that group the ID's together, e.g.
    Group 1:1,2, 16, 200
    Group 2:99, 136, 555
    ....
    Group 15: 123, 124, 999


    The two tables above can contain non-overlapping entries though most of the
    time the entries do overlap. So the task is to make use of the 2nd table and
    to look up the first table for evaluation. By checking group 1, we know that
    ID's 1,2, 16 and 200 are in the same group. And what to evaluate? Say, if
    ID's 16 and 200 also contain subC3 in F3, then we can assign this group 1
    members to be subC3. If the consistency does not exceed certain threshold
    (e.g. 0.7), we then look up F2 and check the statistics again and may
    conclude C1, and finally for F1 (SuperC3). If F1 consistency also cannot
    pass the threshold, "inconsistent" is returned. The main problem here is: I
    have been overwhelmed by hash of hash of hash !

    First, I use a hash to bin the groups with the ID's, e.g.
    $vhash{$group}{$ID} = 1;
    and memorize which $ID belongs to which $group
    $group{$ID} = $groupID;


    Then I use another hash to bin the other table, e.g. ($cells[] refers to the
    parsing result while reading the table in a file)

    $chash{$ID}{F1}{$cells[$F1_location]} = 1;
    $chash{$ID}{F2}{$cells[$F2_location]} = 1;
    $chash{$ID}{F3}{$cells[$F3_location]} = 1;

    And the complicated thing appears here....

    for $cid (keys %chash) {
    if (exists $group{$cid}) {
    for $f1 (keys %{%chash{$cid}{F1} {
    for $gid (keys %($vhash{$group{$cid}}) {
    .....

    Okay, I know it's difficult to read... and that's why I'm seeking advice
    here as the complicated stuff only just starts there... Is there any
    convenient data structure that helps better manage this kind of
    "cross/circulate-referencing"-like problem?
     
    ela, Aug 5, 2011
    #8
  9. Rainer Weikusat

    ela Guest

    Thanks both Rainer and Ben. The 1st table does contain duplicate ID's but
    not duplicate rows because one ID can contain several F3's so (ID, F3) form
    a primary key.

    Now I learn more about how to tackle this problem from you and shall benefit
    in reusing the skills learnt in future. Thanks a lot!!
     
    ela, Aug 5, 2011
    #9
  10. Rainer Weikusat

    ela Guest

    "Rainer Weikusat" <> wrote in message
    news:...
    > Then, you create a hash mapping the column name to the column value
    > for each ID and put these hashes into an id hash associated with the
    > ID:
    >
    > $id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };


    Well, the duplicate ID curse appears here... so maybe an array has to use
    instead?

    $id{1}[0] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
    $id{1}[1] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC5' };
    $id{2}[0] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

    > $seen{$_} = 1 for (map { $id{$_}{$col} } @{$grp{$grp}});
    > return 1.0 / keys(%seen);
    > This takes a group ID and a column name as argument and returns the
    > 'consistency' of this column for this group.


    In fact, I don't quite understand what these codes are doing. what does
    $grp{$grp} refer to? And as it's no longer $id{$_} but $id{}[], so how to
    adjust then?

    > my $grp = $_[0];

    Why do you make it an array element?

    > consistency($grp, $_) >= THRESHOLD and return $_ for (@order);

    the exact item (here, say, C1, in addition to F2) should also be
    returned.... (Well, that's why I said the problem is complicated...)

    > $res //= 'inconsistent';

    Is this a typo? What's this?
     
    ela, Aug 5, 2011
    #10
  11. Rainer Weikusat

    ela Guest

    "Rainer Weikusat" <> wrote in message
    news:...
    > Provided that all different 'column lists' associated with the same ID
    > can be treated as if they were associated with different IDs, you
    > could turn the map expression into something like
    >
    > map { map { $_->{$col} } @{$id{$_}}; } @{$grp{$grp}}


    "map" seems to be doing the "cross-referencing" task that I have to
    accomplish. I have just learnt more complicated data structures a few months
    ago and would appreciate if you could explain what's doing here. So when an
    element from the $grp is extracted, it becomes the second "$_" by the
    outside map? and then the inner map extracts another array element from $id
    and this element becomes the first "$_"? Finally, is there a quick way to
    find the intersection between the $grp and $id hashes? I want to learn your
    elegant one-line map code but it cannot process the non-overlapping id's so
    I have to process them the other way.

    > // is similar to || except that it tests for 'definedness' and not
    > 'trueness': It will assign the string to the variable if its value is
    > undef.


    but this error "Search pattern not terminated at test.pl" occurs...
     
    ela, Aug 5, 2011
    #11
  12. Rainer Weikusat

    ela Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > my ($most_freq, %count) = ('', '' => 0);

    I guess the above line is doing some sort of initialization of zero's, but
    why is ", " used? Making both $most_freq and %count separated by comma to be
    zero?

    > for my $item (@_) {

    How does this @_ correspond to @{$column[$col]}[@group] below?

    > $most_freq = $item if ++$count{$item} > $count{$most_freq};
    > }
    > return ($most_freq, $count{$most_freq}/@_);
    > }
    >
    > This is passed a slice for a given column and group:
    >
    > most_frequent(@{$column[$col]}[@group])
     
    ela, Aug 5, 2011
    #12
  13. "ela" <> wrote:
    >"Ben Bacarisse" <> wrote in message
    >news:...
    >> Functions are crucial to managing complexity. I'd want a function
    >> 'most_frequent' that can take an array of values and find the frequency
    >> of the most common value among them. It could return both that value
    >> and the frequency. Something like:

    >
    >I'd appreciate if I can learn more from you about the thinking philosophy.
    >As said previously, I only thought of a lot of "if"'s and "hash"'s and never
    >able to use function to wrap up some of the concepts. Would you mind telling
    >me by which cues trigger you to think about using function?


    That is actually a good question. And while there are many factors I
    don't think there are any hard rules although learning how to split a
    problem into smaller parts is the most crucial skill in software
    engineering.

    Some indicators for when to split code into smaller units and/or create
    a function:
    - abstract data types: when you design an abstract data type, then there
    are standard operations which you need with any data type like e.g.
    creating, initializing, and deleting elements and modifying values. For
    example for anything list of xyz you need an empty list, append to a
    list, access element of list, very likely concatenate two lists, apply
    function to each element of list, .... So just go ahead and write a
    function for each, sooner or later you will need it.
    - application domain: often the application domain already provides for
    a set of functions, e.g. if I were to write a module for statistics
    obviously I need functions for mean, average, standard deviation, maybe
    min and max, and so forth.
    - code reuse: if I find myself typing the same code or very similar code
    multiple times, then check if it can be broken out into a function,
    possibly with parameters to account for the minor differences between
    almost the same code.
    - code complexity: if there are more then 2 to a max of 3 levels of
    indentation (nested loops and if's) then usually that is an indication
    that part of the code should probably be split off into a function.
    - code complexity: an odd rule of thumb which nevertheless I still find
    a very useful guideline: a VT220 had 24 lines of text. If your function
    grew longer than about 2/3 of the screen (i.e. more than ~16 lines) if
    was time to consider breaking it up. And if it didn't fit on a single
    screen any more, then it was definitely time to check why this function
    was so long.
    - algorithmic considerations: whenever the algorithm is asking for
    recursion then obviously you should create a function.
    - code complexity: whenever you are working with higher order functions
    consider to name the arguments, i.e. create a named function for
    wanted() in File::find or the filter function in grep() and so on. Of
    course for simple cases you can use anonymous functions, but whenever it
    is a non-trivial function it helps to name it.
    - algorithmic consideration: sometimes loops are better written as
    recursions. Obviously in such cases you will need a function.
    - algorithmic consideration: return values. Whenever I find myself
    thinking 'and now all I need is to compute foobar from these 4 values'
    then this is a strong indication that I should write a function foobar()
    that takes 4 arguments and returns the desired value.

    I am sure there are many more indications. But I am also sure that part
    of it is personal preference. And most of all it is a matter of
    experience. After all software engineering is still one of the major
    fundamentals of computer science and functions are smallest building
    blocks.

    jue
     
    Jürgen Exner, Aug 5, 2011
    #13
  14. Rainer Weikusat

    Justin C Guest

    Re: Sorting hash %seen

    On 2011-08-06, ela <> wrote:
    >
    > "Rainer Weikusat" <> wrote in message
    >
    >> Then, you'll need something similar to Ben's most_frequent routine

    >
    > I managed to take frequency by:
    >
    > $seen{$_}++ for (map { $id{$_}{$col} } @{$grp{$grpid}});


    I've not tried it (and, TBH, I have trouble getting my head round it,
    map is still new to me) but that doesn't look right to me.

    AIUI, the $_ in $seen{$_}++ takes whatever is in $_ before you do the
    'map', the $_ in the map function is, I believe, local to the map
    function and therefore not available outside the function.


    Justin.

    --
    Justin C, by the sea.
     
    Justin C, Aug 5, 2011
    #14
  15. "ela" <> writes:

    > "Ben Bacarisse" <> wrote in message
    > news:...
    >> Functions are crucial to managing complexity. I'd want a function
    >> 'most_frequent' that can take an array of values and find the frequency
    >> of the most common value among them. It could return both that value
    >> and the frequency. Something like:

    >
    > I'd appreciate if I can learn more from you about the thinking philosophy.
    > As said previously, I only thought of a lot of "if"'s and "hash"'s and never
    > able to use function to wrap up some of the concepts. Would you mind telling
    > me by which cues trigger you to think about using function?


    There's no easy answer to that. You must try to get into the habit of
    dreaming. You think: what function, if it were available, would make
    the job easier? The academic answer is that you think "top down" -- you
    imagine the very highest level of the program before you have any idea
    how to write it:

    table = read_table();
    for each group:
    print classify(group, table);

    You know that classify will need both the group list and the full table
    to do its job so you pass these as parameters. Then you break down
    classify:

    classify(group, table):
    for each column in table
    top_item = most_common_item_in(column, table);
    freq = freq_of(top_item, column, table);
    if freq > threshold
    return top_item
    return 'inconsistent'

    Here you go "ah, classify needs to know the threshold" so you revise the
    parameter list.

    When you write most_common_item_in and freq_of you will find that the do
    very similar things and you may decide to combine them. That's what I
    did.

    The trouble with this plan (and why I say this is the academic answer)
    is that this breakdown interacts at all stages with the design of the
    data structures that you will use. The result is, in practice, a lot
    more going back and forth between different ideas. When you find
    something is getting too messy to write, you should re-think your data
    structures. That takes a lot of experience though. What is too messy?
    maybe this is a messy problem and there is no neat solution?

    I had exactly this problem. I told you that you probably don't want an
    array of columns because I'd missed the duplicate ID problem. The
    trouble was that I did not go back and revise my design. If I had, I'd
    have seen that a simple change to the data structure and the way it is
    input is all this is needed. If we store arrays of items at each
    position it is simple to extract and "flatten" these before counting the
    frequencies. I.e. reading the data becomes

    my @column;
    while (<>) {
    chomp;
    my (@row, $c) = split;
    push @{$column[$c++]->[$row[0]]}, $_ foreach @row;
    }

    The slice @{$column[$col]}[@some_array_of_rows] is now and array of
    array references so we need to flatten it. A function to do that is

    sub flatten { map {@$_} @_ }

    The classify function might then be

    sub classify
    {
    my ($group, $table, $threshold) = @_;
    for (my $col = 1; $col < $#column; $col++) {
    my ($item, $freq) =
    most_frequent(flatten(@{@$table[$col]}[@$group]));
    return $item if $freq >= $threshold;
    }
    return 'inconsistent';
    }

    I start at 1 because I have retained the (possible multiple) IDs in the
    table. It makes sense to retain them because it is logically possible
    to classify by the first columns as much as it is by any other even
    though you don't do this in your case.

    Making everything a parameter to classify is obviously the right thing
    to do but it does make the key line rather fussy. Still, three short
    functions and some input code is all it took in the end.

    --
    Ben.
     
    Ben Bacarisse, Aug 5, 2011
    #15
  16. Re: Sorting hash %seen

    Justin C <> writes:
    > On 2011-08-06, ela <> wrote:
    >>
    >> "Rainer Weikusat" <> wrote in message
    >>
    >>> Then, you'll need something similar to Ben's most_frequent routine

    >>
    >> I managed to take frequency by:
    >>
    >> $seen{$_}++ for (map { $id{$_}{$col} } @{$grp{$grpid}});

    >
    > I've not tried it (and, TBH, I have trouble getting my head round it,
    > map is still new to me) but that doesn't look right to me.
    >
    > AIUI, the $_ in $seen{$_}++ takes whatever is in $_ before you do the
    > 'map', the $_ in the map function is, I believe, local to the map
    > function and therefore not available outside the function.


    Yes. But the for/ foreach loop which processes the 'output' of the
    map expression successively binds $_ to each of the values returned by
    that.
     
    Rainer Weikusat, Aug 5, 2011
    #16
  17. Re: Sorting hash %seen

    Tad McClellan <> writes:
    > Justin C <> wrote:
    >> On 2011-08-06, ela <> wrote:
    >>> "Rainer Weikusat" <> wrote in message
    >>>
    >>>> Then, you'll need something similar to Ben's most_frequent routine
    >>>
    >>> I managed to take frequency by:
    >>>
    >>> $seen{$_}++ for (map { $id{$_}{$col} } @{$grp{$grpid}});

    >>
    >> I've not tried it (and, TBH, I have trouble getting my head round it,
    >> map is still new to me) but that doesn't look right to me.
    >>
    >> AIUI, the $_ in $seen{$_}++ takes whatever is in $_ before you do the
    >> 'map',

    >
    > No, it takes on the values returned from the map.
    >
    > Rewrite it, and it may become more clear:
    >
    > foreach (map { $id{$_}{$col} } @{$grp{$grpid}}) {
    > $seen{$_}++
    > }


    Provided someone who reads the code is familiar with the semantics of
    loop executing code contained in a block but not familiar with the
    semantics of the equivalent statement modifiers, this someone will
    understand the 'code with the block' and won't understand the 'code
    with statement modifier'. Neither one nor the other has any meaning
    someone completely without knowledge of the defined semantics of each
    construct can understand, and both constructs necessarily have
    'clearly' defined semantics because otherwise, a computer couldn't
    execute them, IOW, this is not a problem of one variant vs the other
    BUT of availability of or lack of knowledge on part of the person
    trying to understand the code.
     
    Rainer Weikusat, Aug 5, 2011
    #17
  18. Re: Choice of data structure

    "ela" <> writes:
    > "Rainer Weikusat" <> wrote in message
    > news:...
    >> "ela" <> writes:
    >>> I've been working on this problem for 4 days and still cannot come out a
    >>> good solution and would appreciate if you could comment on the problem.
    >>>
    >>> Given a table containing cells delimited by tab like this

    >>
    >> [ please see original for the indeed gory details ]
    >>
    >> Provided I understood the problem correctly, a possible solution could
    >> look like this (this code has had very little testing): First, you
    >> define your groups by associating array references containing the group
    >> members with the 'group ID' with the help of a hash:
    >>
    >> $grp{1} = [1, 2];
    >>
    >> Then, you create a hash mapping the column name to the column value
    >> for each ID and put these hashes into an id hash associated with the
    >> ID:
    >>
    >> $id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
    >> $id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

    >
    > While I'm revising the codes, I find that just because I overrely on hash
    > and that complicates my problem. What made you make a decision on using
    > array for "group" while hash for "id"?


    'Groups' didn't have named columns.
     
    Rainer Weikusat, Aug 5, 2011
    #18
  19. Ben Bacarisse <> writes:
    > "ela" <> writes:
    >> "Ben Bacarisse" <> wrote in message


    [...]

    >> I'd appreciate if I can learn more from you about the thinking philosophy.
    >> As said previously, I only thought of a lot of "if"'s and "hash"'s and never
    >> able to use function to wrap up some of the concepts. Would you mind telling
    >> me by which cues trigger you to think about using function?

    >
    > There's no easy answer to that. You must try to get into the habit of
    > dreaming. You think: what function, if it were available, would make
    > the job easier? The academic answer is that you think "top down" -- you
    > imagine the very highest level of the program before you have any idea
    > how to write it:
    >
    > table = read_table();
    > for each group:
    > print classify(group, table);
    >
    > You know that classify will need both the group list and the full table
    > to do its job so you pass these as parameters. Then you break down
    > classify:
    >
    > classify(group, table):
    > for each column in table
    > top_item = most_common_item_in(column, table);
    > freq = freq_of(top_item, column, table);
    > if freq > threshold
    > return top_item
    > return 'inconsistent'
    >
    > Here you go "ah, classify needs to know the threshold" so you revise the
    > parameter list.
    >
    > When you write most_common_item_in and freq_of you will find that the do
    > very similar things and you may decide to combine them. That's what I
    > did.
    >
    > The trouble with this plan (and why I say this is the academic answer)
    > is that this breakdown interacts at all stages with the design of the
    > data structures that you will use.


    I assure you that this is not 'the academic answer' but a perfectly
    workable design methodology which has essentially been ignored ever
    since its invention in the last century, using whatever pretext seemed
    to be most suitable for that. For as long as the intent is to create
    working and easily maintainable code in order to solve problems (as
    opposed to, say, "contribute your name to the Linux kernel changelog
    for the sake of it being in there") 'stepwise refinement' is
    definitely worth trying it instead of just assuming that it cannot
    possibly work and hence - thank god! - 'we' can continue with the
    time-honoured procedure of 'hacking away at it until it's all pieces'.
     
    Rainer Weikusat, Aug 5, 2011
    #19
  20. Rainer Weikusat

    Jim Gibson Guest

    Re: Sorting hash %seen

    In article <j1gtt0$gsk$>, ela
    <> wrote:

    > "Tad McClellan" <> wrote in message
    > news:...
    > > ela <> wrote:
    > >
    > >> my @sorted_keys = sort { $seen{$b} <=> $seen{$a}}

    > >
    > >
    > > You have left off the list of things to be sorted.

    >
    > adding %seen solves the problem though warning "Use of uninitialized value
    > in numeric comparison (<=>) at test.pl" still exists.
    >
    > Due to time limitation, I shall focus on working out a draft solution first
    > and thank you very much for being patient in teaching me netiquette and a
    > lot of other stuff these years.


    You don't want to sort %seen, as that has no meaning. You want to sort
    the keys of %seen:

    my @sorted_keys = sort { $seen{$b} <=> $seen{$a}} keys %seen;

    --
    Jim Gibson
     
    Jim Gibson, Aug 5, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. bill
    Replies:
    3
    Views:
    687
  2. iksrazal
    Replies:
    2
    Views:
    349
    enrique
    Apr 27, 2005
  3. Steve W. Jackson

    Seeking class hierarchy advice

    Steve W. Jackson, Apr 26, 2006, in forum: Java
    Replies:
    1
    Views:
    282
    Chris Uppal
    Apr 27, 2006
  4. Dave
    Replies:
    2
    Views:
    372
    David Harmon
    Apr 13, 2004
  5. Eric
    Replies:
    0
    Views:
    301
Loading...

Share This Page