seeking advice on problem difficulty

  • Thread starter Rainer Weikusat
  • Start date
R

Rainer Weikusat

ela said:
I've been working on this problem for 4 days and still cannot come out a
good solution and would appreciate if you could comment on the problem.

Given a table containing cells delimited by tab like this

[ please see original for the indeed gory details ]

Provided I understood the problem correctly, a possible solution could
look like this (this code has had very little testing): First, you
define your groups by associating array references containing the group
with the 'group ID' with the help of a hash:

$grp{1} = [1, 2];

Then, you create a hash mapping the column name to the column value
for each ID and put these hashes into an id hash associated with the
ID:

$id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
$id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

Provided this has been done, a 'consistency' routine can be defined as

sub consistency($$)
{
my ($grp, $col) = @_;
my %seen;

$seen{$_} = 1
for (map { $id{$_}{$col} } @{$grp{$grp}});

return 1.0 / keys(%seen);
}

This takes a group ID and a column name as argument and returns the
'consistency' of this column for this group. Then, an array needs to
be created which names the columns in the order they are supposed to
be checked in:

@order = qw(F3 F2 F1);

Now, a 'decide' routine can be defined like this:

sub decide($)
{
my $grp = $_[0];

consistency($grp, $_) >= THRESHOLD and return $_
for (@order);

return undef;
}

This takes a group ID as argument and returns either the name of the
first column (checked in the order given by @order) whose consistency
is >= the THRESHOLD or undef :):= inconsistent). As a complete script:

------------------
#!/usr/bin/perl

use constant THRESHOLD => 0.7;

my (%grp, %id, @order, $res);

@order = qw(F3 F2 F1);

$grp{1} = [1, 2];

$id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
$id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

sub consistency($$)
{
my ($grp, $col) = @_;
my %seen;

$seen{$_} = 1
for (map { $id{$_}{$col} } @{$grp{$grp}});

return 1.0 / keys(%seen);
}

sub decide($)
{
my $grp = $_[0];

consistency($grp, $_) >= THRESHOLD and return $_
for (@order);

return undef;
}

$res = decide(1);
$res //= 'inconsistent';

print("$res\n");
 
R

Rainer Weikusat

ela said:
I've been working on this problem for 4 days and still cannot come out a
good solution and would appreciate if you could comment on the problem.

Given a table containing cells delimited by tab like this

[ please see original for the indeed gory details ]

Provided I understood the problem correctly, a possible solution could
look like this (this code has had very little testing): First, you
define your groups by associating array references containing the group
members with the 'group ID' with the help of a hash:

$grp{1} = [1, 2];

Then, you create a hash mapping the column name to the column value
for each ID and put these hashes into an id hash associated with the
ID:

$id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
$id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

Provided this has been done, a 'consistency' routine can be defined as

sub consistency($$)
{
my ($grp, $col) = @_;
my %seen;

$seen{$_} = 1
for (map { $id{$_}{$col} } @{$grp{$grp}});

return 1.0 / keys(%seen);
}

This takes a group ID and a column name as argument and returns the
'consistency' of this column for this group. Then, an array needs to
be created which names the columns in the order they are supposed to
be checked in:

@order = qw(F3 F2 F1);

Now, a 'decide' routine can be defined like this:

sub decide($)
{
my $grp = $_[0];

consistency($grp, $_) >= THRESHOLD and return $_
for (@order);

return undef;
}

This takes a group ID as argument and returns either the name of the
first column (checked in the order given by @order) whose consistency
is >= the THRESHOLD or undef :):= inconsistent). As a complete script:

------------------
#!/usr/bin/perl

use constant THRESHOLD => 0.7;

my (%grp, %id, @order, $res);

@order = qw(F3 F2 F1);

$grp{1} = [1, 2];

$id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
$id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

sub consistency($$)
{
my ($grp, $col) = @_;
my %seen;

$seen{$_} = 1
for (map { $id{$_}{$col} } @{$grp{$grp}});

return 1.0 / keys(%seen);
}

sub decide($)
{
my $grp = $_[0];

consistency($grp, $_) >= THRESHOLD and return $_
for (@order);

return undef;
}

$res = decide(1);
$res //= 'inconsistent';

print("$res\n");
 
B

Ben Bacarisse

ela said:
I've been working on this problem for 4 days and still cannot come out a
good solution and would appreciate if you could comment on the problem.

Given a table containing cells delimited by tab like this (the 1st line
being the header and below I use delimiter space for clarity):

ID F1 F2 F3 F4 F5
1 SuperC3 C1 subC4 dummy hi
1 SuperC3 C1 subC3 dumdum hell
2 SuperC3 C1 subC3 hello hello
3 SuperC3 C2 subC7 hel hel

You have a duplicate ID. If this is not a typo, please ignore the rest
of the post -- I've assumed that IDs are unique.
...
1000 SuperC1 C8 subC10 hi hi


and I have another table that group the ID's together, e.g.
Group 1:1,2, 16, 200
Group 2:99, 136, 555
...
Group 15: 123, 124, 999


The two tables above can contain non-overlapping entries though most of the
time the entries do overlap. So the task is to make use of the 2nd table and
to look up the first table for evaluation. By checking group 1, we know that
ID's 1,2, 16 and 200 are in the same group. And what to evaluate? Say, if
ID's 16 and 200 also contain subC3 in F3, then we can assign this group 1
members to be subC3. If the consistency does not exceed certain threshold
(e.g. 0.7), we then look up F2 and check the statistics again and may
conclude C1, and finally for F1 (SuperC3). If F1 consistency also cannot
pass the threshold, "inconsistent" is returned. The main problem here is: I
have been overwhelmed by hash of hash of hash !

I would not use hashes for this data. It is naturally an array of
arrays. Because you need to scan a column at selected row positions,
I'd make it an array of columns:

# after skipping the first line...
my @column = ();
while (<>) {
my $c = 0;
push(@{$column[$c++]}, $_) foreach split;
}

Now you can use a slice to extract the data for a particular column.
Having got your group list as an array:

my @group = map {$_ - 1} split /,\s*/, $group_string;

(the -1 is to adjust for zero based indexing) you can write

@{$column[0]}[@group]

to get the items from the first column whose frequencies you are
interested in.

for $cid (keys %chash) {
if (exists $group{$cid}) {
for $f1 (keys %{%chash{$cid}{F1} {
for $gid (keys %($vhash{$group{$cid}}) {
....

Functions are crucial to managing complexity. I'd want a function
'most_frequent' that can take an array of values and find the frequency
of the most common value among them. It could return both that value
and the frequency. Something like:

sub most_frequent
{
my ($most_freq, %count) = ('', '' => 0);
for my $item (@_) {
$most_freq = $item if ++$count{$item} > $count{$most_freq};
}
return ($most_freq, $count{$most_freq}/@_);
}

This is passed a slice for a given column and group:

most_frequent(@{$column[$col]}[@group])

With that function you simply need to find the first $col whose most
frequent item meets your threshold. 'First' may not be correct, since
in your example you consider F3 before F1. Maybe you need the item from
any column that has the greatest frequency? Maybe there is a specified
order in which the columns should be tested? Whatever the answer, it
should be easy with the above function.

<snip>
 
R

Rainer Weikusat

[...]

my @column = ();

There is little reason to empty an empty array by assigning an empty
list to it.

[...]
sub most_frequent
{
my ($most_freq, %count) = ('', '' => 0);

A more sensible initialization would be

my ($most_freq, %count);

$most_freq = shift;
$count{$most_freq} = 1;
for my $item (@_) {
$most_freq = $item if ++$count{$item} > $count{$most_freq};
}
return ($most_freq, $count{$most_freq}/@_);

and then use (@_ + 1) for the division.
 
R

Rainer Weikusat

ela said:
Rainer Weikusat said:
Then, you create a hash mapping the column name to the column value
for each ID and put these hashes into an id hash associated with the
ID:

$id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };

Well, the duplicate ID curse appears here... so maybe an array has to use
instead?

$id{1}[0] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
$id{1}[1] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC5' };
$id{2}[0] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };
$seen{$_} = 1 for (map { $id{$_}{$col} } @{$grp{$grp}});
return 1.0 / keys(%seen);
This takes a group ID and a column name as argument and returns the
'consistency' of this column for this group.

In fact, I don't quite understand what these codes are doing. what does
$grp{$grp} refer to?

Using the same name twice was arguably somewhat confusing: The inner
$grp refers to the scalar variable $grp (defined in the function) that
holds the group ID and it is used to select the value (array of IDs)
associated with the corresponding key in the %grp hash.
And as it's no longer $id{$_} but $id{}[], so how to
adjust then?

Provided that all different 'column lists' associated with the same ID
can be treated as if they were associated with different IDs, you
could turn the map expression into something like

map { map { $_->{$col} } @{$id{$_}}; } @{$grp{$grp}}

(uncompiled).
my $grp = $_[0];
Why do you make it an array element?

Make what? This line defined a scalar variable named $grp and assigned
the first argument of the sub to it: All arguments are passed in the
array @_, hence $_[0] is the first argument.
the exact item (here, say, C1, in addition to F2) should also be
returned.... (Well, that's why I said the problem is complicated...)

Then, you'll need something similar to Ben's most_frequent routine.
Is this a typo? What's this?

// is similar to || except that it tests for 'definedness' and not
'trueness': It will assign the string to the variable if its value is
undef.
 
R

Randal L. Schwartz

ela> but this error "Search pattern not terminated at test.pl"
ela> occurs...

Upgrade your Perl to 5.10 or later.
 
B

Ben Bacarisse

ela said:
I guess the above line is doing some sort of initialization of zero's, but
why is ", " used? Making both $most_freq and %count separated by comma to be
zero?

It's equivalent to

my $most_freq = '';
my %count;
$count{$most_freq} = 0;

but that's not a very good choice. I should have done what one does
with functions like max and min and set $most_freq to the first data
item as Rainer suggested.
for my $item (@_) {
How does this @_ correspond to @{$column[$col]}[@group] below?

That's how subroutines and argument passing work in Perl. I don't know
what else to say. It's too long since I read a Perl text so if I try to
explain it I'll probably use the wrong terms but let me try...
Subroutine arguments are evaluated in list context so any arrays you
write there get joined together (along with non-array arguments) into
one big argument array. Inside the subroutine The name for this array
is @_. In this case there is only one argument: this list that results
from slicing the array @{$column[$col]} with the @group array.

Did you catch the bit where I said to ignore what I wrote if the
duplicate IDs were not a typo? I really meant it. I don't think
slicing a column array is the right way to go with the data you have.
You might still find most_frequent a useful function, but that's about
it.

<snip>
 
E

ela

Dear all,

I've been working on this problem for 4 days and still cannot come out a
good solution and would appreciate if you could comment on the problem.

Given a table containing cells delimited by tab like this (the 1st line
being the header and below I use delimiter space for clarity):

ID F1 F2 F3 F4 F5
1 SuperC3 C1 subC4 dummy hi
1 SuperC3 C1 subC3 dumdum hell
2 SuperC3 C1 subC3 hello hello
3 SuperC3 C2 subC7 hel hel
....
1000 SuperC1 C8 subC10 hi hi


and I have another table that group the ID's together, e.g.
Group 1:1,2, 16, 200
Group 2:99, 136, 555
....
Group 15: 123, 124, 999


The two tables above can contain non-overlapping entries though most of the
time the entries do overlap. So the task is to make use of the 2nd table and
to look up the first table for evaluation. By checking group 1, we know that
ID's 1,2, 16 and 200 are in the same group. And what to evaluate? Say, if
ID's 16 and 200 also contain subC3 in F3, then we can assign this group 1
members to be subC3. If the consistency does not exceed certain threshold
(e.g. 0.7), we then look up F2 and check the statistics again and may
conclude C1, and finally for F1 (SuperC3). If F1 consistency also cannot
pass the threshold, "inconsistent" is returned. The main problem here is: I
have been overwhelmed by hash of hash of hash !

First, I use a hash to bin the groups with the ID's, e.g.
$vhash{$group}{$ID} = 1;
and memorize which $ID belongs to which $group
$group{$ID} = $groupID;


Then I use another hash to bin the other table, e.g. ($cells[] refers to the
parsing result while reading the table in a file)

$chash{$ID}{F1}{$cells[$F1_location]} = 1;
$chash{$ID}{F2}{$cells[$F2_location]} = 1;
$chash{$ID}{F3}{$cells[$F3_location]} = 1;

And the complicated thing appears here....

for $cid (keys %chash) {
if (exists $group{$cid}) {
for $f1 (keys %{%chash{$cid}{F1} {
for $gid (keys %($vhash{$group{$cid}}) {
.....

Okay, I know it's difficult to read... and that's why I'm seeking advice
here as the complicated stuff only just starts there... Is there any
convenient data structure that helps better manage this kind of
"cross/circulate-referencing"-like problem?
 
E

ela

Thanks both Rainer and Ben. The 1st table does contain duplicate ID's but
not duplicate rows because one ID can contain several F3's so (ID, F3) form
a primary key.

Now I learn more about how to tackle this problem from you and shall benefit
in reusing the skills learnt in future. Thanks a lot!!
 
E

ela

Rainer Weikusat said:
Then, you create a hash mapping the column name to the column value
for each ID and put these hashes into an id hash associated with the
ID:

$id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };

Well, the duplicate ID curse appears here... so maybe an array has to use
instead?

$id{1}[0] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
$id{1}[1] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC5' };
$id{2}[0] = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };
$seen{$_} = 1 for (map { $id{$_}{$col} } @{$grp{$grp}});
return 1.0 / keys(%seen);
This takes a group ID and a column name as argument and returns the
'consistency' of this column for this group.

In fact, I don't quite understand what these codes are doing. what does
$grp{$grp} refer to? And as it's no longer $id{$_} but $id{}[], so how to
adjust then?
my $grp = $_[0];
Why do you make it an array element?
consistency($grp, $_) >= THRESHOLD and return $_ for (@order);
the exact item (here, say, C1, in addition to F2) should also be
returned.... (Well, that's why I said the problem is complicated...)
$res //= 'inconsistent';
Is this a typo? What's this?
 
E

ela

Rainer Weikusat said:
Provided that all different 'column lists' associated with the same ID
can be treated as if they were associated with different IDs, you
could turn the map expression into something like

map { map { $_->{$col} } @{$id{$_}}; } @{$grp{$grp}}

"map" seems to be doing the "cross-referencing" task that I have to
accomplish. I have just learnt more complicated data structures a few months
ago and would appreciate if you could explain what's doing here. So when an
element from the $grp is extracted, it becomes the second "$_" by the
outside map? and then the inner map extracts another array element from $id
and this element becomes the first "$_"? Finally, is there a quick way to
find the intersection between the $grp and $id hashes? I want to learn your
elegant one-line map code but it cannot process the non-overlapping id's so
I have to process them the other way.
// is similar to || except that it tests for 'definedness' and not
'trueness': It will assign the string to the variable if its value is
undef.

but this error "Search pattern not terminated at test.pl" occurs...
 
E

ela

Ben Bacarisse said:
my ($most_freq, %count) = ('', '' => 0);
I guess the above line is doing some sort of initialization of zero's, but
why is ", " used? Making both $most_freq and %count separated by comma to be
zero?
for my $item (@_) {
How does this @_ correspond to @{$column[$col]}[@group] below?
$most_freq = $item if ++$count{$item} > $count{$most_freq};
}
return ($most_freq, $count{$most_freq}/@_);
}

This is passed a slice for a given column and group:

most_frequent(@{$column[$col]}[@group])
 
J

Jürgen Exner

ela said:
I'd appreciate if I can learn more from you about the thinking philosophy.
As said previously, I only thought of a lot of "if"'s and "hash"'s and never
able to use function to wrap up some of the concepts. Would you mind telling
me by which cues trigger you to think about using function?

That is actually a good question. And while there are many factors I
don't think there are any hard rules although learning how to split a
problem into smaller parts is the most crucial skill in software
engineering.

Some indicators for when to split code into smaller units and/or create
a function:
- abstract data types: when you design an abstract data type, then there
are standard operations which you need with any data type like e.g.
creating, initializing, and deleting elements and modifying values. For
example for anything list of xyz you need an empty list, append to a
list, access element of list, very likely concatenate two lists, apply
function to each element of list, .... So just go ahead and write a
function for each, sooner or later you will need it.
- application domain: often the application domain already provides for
a set of functions, e.g. if I were to write a module for statistics
obviously I need functions for mean, average, standard deviation, maybe
min and max, and so forth.
- code reuse: if I find myself typing the same code or very similar code
multiple times, then check if it can be broken out into a function,
possibly with parameters to account for the minor differences between
almost the same code.
- code complexity: if there are more then 2 to a max of 3 levels of
indentation (nested loops and if's) then usually that is an indication
that part of the code should probably be split off into a function.
- code complexity: an odd rule of thumb which nevertheless I still find
a very useful guideline: a VT220 had 24 lines of text. If your function
grew longer than about 2/3 of the screen (i.e. more than ~16 lines) if
was time to consider breaking it up. And if it didn't fit on a single
screen any more, then it was definitely time to check why this function
was so long.
- algorithmic considerations: whenever the algorithm is asking for
recursion then obviously you should create a function.
- code complexity: whenever you are working with higher order functions
consider to name the arguments, i.e. create a named function for
wanted() in File::find or the filter function in grep() and so on. Of
course for simple cases you can use anonymous functions, but whenever it
is a non-trivial function it helps to name it.
- algorithmic consideration: sometimes loops are better written as
recursions. Obviously in such cases you will need a function.
- algorithmic consideration: return values. Whenever I find myself
thinking 'and now all I need is to compute foobar from these 4 values'
then this is a strong indication that I should write a function foobar()
that takes 4 arguments and returns the desired value.

I am sure there are many more indications. But I am also sure that part
of it is personal preference. And most of all it is a matter of
experience. After all software engineering is still one of the major
fundamentals of computer science and functions are smallest building
blocks.

jue
 
J

Justin C

I managed to take frequency by:

$seen{$_}++ for (map { $id{$_}{$col} } @{$grp{$grpid}});

I've not tried it (and, TBH, I have trouble getting my head round it,
map is still new to me) but that doesn't look right to me.

AIUI, the $_ in $seen{$_}++ takes whatever is in $_ before you do the
'map', the $_ in the map function is, I believe, local to the map
function and therefore not available outside the function.


Justin.
 
B

Ben Bacarisse

ela said:
I'd appreciate if I can learn more from you about the thinking philosophy.
As said previously, I only thought of a lot of "if"'s and "hash"'s and never
able to use function to wrap up some of the concepts. Would you mind telling
me by which cues trigger you to think about using function?

There's no easy answer to that. You must try to get into the habit of
dreaming. You think: what function, if it were available, would make
the job easier? The academic answer is that you think "top down" -- you
imagine the very highest level of the program before you have any idea
how to write it:

table = read_table();
for each group:
print classify(group, table);

You know that classify will need both the group list and the full table
to do its job so you pass these as parameters. Then you break down
classify:

classify(group, table):
for each column in table
top_item = most_common_item_in(column, table);
freq = freq_of(top_item, column, table);
if freq > threshold
return top_item
return 'inconsistent'

Here you go "ah, classify needs to know the threshold" so you revise the
parameter list.

When you write most_common_item_in and freq_of you will find that the do
very similar things and you may decide to combine them. That's what I
did.

The trouble with this plan (and why I say this is the academic answer)
is that this breakdown interacts at all stages with the design of the
data structures that you will use. The result is, in practice, a lot
more going back and forth between different ideas. When you find
something is getting too messy to write, you should re-think your data
structures. That takes a lot of experience though. What is too messy?
maybe this is a messy problem and there is no neat solution?

I had exactly this problem. I told you that you probably don't want an
array of columns because I'd missed the duplicate ID problem. The
trouble was that I did not go back and revise my design. If I had, I'd
have seen that a simple change to the data structure and the way it is
input is all this is needed. If we store arrays of items at each
position it is simple to extract and "flatten" these before counting the
frequencies. I.e. reading the data becomes

my @column;
while (<>) {
chomp;
my (@row, $c) = split;
push @{$column[$c++]->[$row[0]]}, $_ foreach @row;
}

The slice @{$column[$col]}[@some_array_of_rows] is now and array of
array references so we need to flatten it. A function to do that is

sub flatten { map {@$_} @_ }

The classify function might then be

sub classify
{
my ($group, $table, $threshold) = @_;
for (my $col = 1; $col < $#column; $col++) {
my ($item, $freq) =
most_frequent(flatten(@{@$table[$col]}[@$group]));
return $item if $freq >= $threshold;
}
return 'inconsistent';
}

I start at 1 because I have retained the (possible multiple) IDs in the
table. It makes sense to retain them because it is logically possible
to classify by the first columns as much as it is by any other even
though you don't do this in your case.

Making everything a parameter to classify is obviously the right thing
to do but it does make the key line rather fussy. Still, three short
functions and some input code is all it took in the end.
 
R

Rainer Weikusat

Justin C said:
I've not tried it (and, TBH, I have trouble getting my head round it,
map is still new to me) but that doesn't look right to me.

AIUI, the $_ in $seen{$_}++ takes whatever is in $_ before you do the
'map', the $_ in the map function is, I believe, local to the map
function and therefore not available outside the function.

Yes. But the for/ foreach loop which processes the 'output' of the
map expression successively binds $_ to each of the values returned by
that.
 
R

Rainer Weikusat

Tad McClellan said:
No, it takes on the values returned from the map.

Rewrite it, and it may become more clear:

foreach (map { $id{$_}{$col} } @{$grp{$grpid}}) {
$seen{$_}++
}

Provided someone who reads the code is familiar with the semantics of
loop executing code contained in a block but not familiar with the
semantics of the equivalent statement modifiers, this someone will
understand the 'code with the block' and won't understand the 'code
with statement modifier'. Neither one nor the other has any meaning
someone completely without knowledge of the defined semantics of each
construct can understand, and both constructs necessarily have
'clearly' defined semantics because otherwise, a computer couldn't
execute them, IOW, this is not a problem of one variant vs the other
BUT of availability of or lack of knowledge on part of the person
trying to understand the code.
 
R

Rainer Weikusat

ela said:
Rainer Weikusat said:
ela said:
I've been working on this problem for 4 days and still cannot come out a
good solution and would appreciate if you could comment on the problem.

Given a table containing cells delimited by tab like this

[ please see original for the indeed gory details ]

Provided I understood the problem correctly, a possible solution could
look like this (this code has had very little testing): First, you
define your groups by associating array references containing the group
members with the 'group ID' with the help of a hash:

$grp{1} = [1, 2];

Then, you create a hash mapping the column name to the column value
for each ID and put these hashes into an id hash associated with the
ID:

$id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
$id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

While I'm revising the codes, I find that just because I overrely on hash
and that complicates my problem. What made you make a decision on using
array for "group" while hash for "id"?

'Groups' didn't have named columns.
 
R

Rainer Weikusat

Ben Bacarisse said:
ela said:
"Ben Bacarisse" <[email protected]> wrote in message
[...]
I'd appreciate if I can learn more from you about the thinking philosophy.
As said previously, I only thought of a lot of "if"'s and "hash"'s and never
able to use function to wrap up some of the concepts. Would you mind telling
me by which cues trigger you to think about using function?

There's no easy answer to that. You must try to get into the habit of
dreaming. You think: what function, if it were available, would make
the job easier? The academic answer is that you think "top down" -- you
imagine the very highest level of the program before you have any idea
how to write it:

table = read_table();
for each group:
print classify(group, table);

You know that classify will need both the group list and the full table
to do its job so you pass these as parameters. Then you break down
classify:

classify(group, table):
for each column in table
top_item = most_common_item_in(column, table);
freq = freq_of(top_item, column, table);
if freq > threshold
return top_item
return 'inconsistent'

Here you go "ah, classify needs to know the threshold" so you revise the
parameter list.

When you write most_common_item_in and freq_of you will find that the do
very similar things and you may decide to combine them. That's what I
did.

The trouble with this plan (and why I say this is the academic answer)
is that this breakdown interacts at all stages with the design of the
data structures that you will use.

I assure you that this is not 'the academic answer' but a perfectly
workable design methodology which has essentially been ignored ever
since its invention in the last century, using whatever pretext seemed
to be most suitable for that. For as long as the intent is to create
working and easily maintainable code in order to solve problems (as
opposed to, say, "contribute your name to the Linux kernel changelog
for the sake of it being in there") 'stepwise refinement' is
definitely worth trying it instead of just assuming that it cannot
possibly work and hence - thank god! - 'we' can continue with the
time-honoured procedure of 'hacking away at it until it's all pieces'.
 
J

Jim Gibson

ela said:
adding %seen solves the problem though warning "Use of uninitialized value
in numeric comparison (<=>) at test.pl" still exists.

Due to time limitation, I shall focus on working out a draft solution first
and thank you very much for being patient in teaching me netiquette and a
lot of other stuff these years.

You don't want to sort %seen, as that has no meaning. You want to sort
the keys of %seen:

my @sorted_keys = sort { $seen{$b} <=> $seen{$a}} keys %seen;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top