seeking advice on problem difficulty

  • Thread starter Rainer Weikusat
  • Start date
B

Ben Bacarisse

Rainer Weikusat said:
Ben Bacarisse said:
ela said:
"Ben Bacarisse" <[email protected]> wrote in message
[...]
I'd appreciate if I can learn more from you about the thinking philosophy.
As said previously, I only thought of a lot of "if"'s and "hash"'s and never
able to use function to wrap up some of the concepts. Would you mind telling
me by which cues trigger you to think about using function?

There's no easy answer to that. You must try to get into the habit of
dreaming. You think: what function, if it were available, would make
the job easier? The academic answer is that you think "top down" -- you
imagine the very highest level of the program before you have any idea
how to write it:

table = read_table();
for each group:
print classify(group, table);

You know that classify will need both the group list and the full table
to do its job so you pass these as parameters. Then you break down
classify:

classify(group, table):
for each column in table
top_item = most_common_item_in(column, table);
freq = freq_of(top_item, column, table);
if freq > threshold
return top_item
return 'inconsistent'

Here you go "ah, classify needs to know the threshold" so you revise the
parameter list.

When you write most_common_item_in and freq_of you will find that the do
very similar things and you may decide to combine them. That's what I
did.

The trouble with this plan (and why I say this is the academic answer)
is that this breakdown interacts at all stages with the design of the
data structures that you will use.

I assure you that this is not 'the academic answer' but a perfectly
workable design methodology

I did not say it was unworkable. It's perfectly workable. However, the
description I gave is "academic" in the sense that it is too simple. At
least that is my experience. One rarely finds novel algorithms that
way, so when a problem has an interesting algorithmic core, top-down
design will often miss some interesting solutions. Also it does not
always lead to good data structures first time round. I often find
myself backing up a few levels, re-jigging the data and setting off
again is a slightly different direction.
which has essentially been ignored ever
since its invention in the last century, using whatever pretext seemed
to be most suitable for that.

Has it been ignored? I was taught it and I taught it to others. The
last time I knew about such things (about a decade ago) it was widely
taught in UK universities.
For as long as the intent is to create
working and easily maintainable code in order to solve problems (as
opposed to, say, "contribute your name to the Linux kernel changelog
for the sake of it being in there") 'stepwise refinement' is
definitely worth trying it instead of just assuming that it cannot
possibly work and hence - thank god! - 'we' can continue with the
time-honoured procedure of 'hacking away at it until it's all pieces'.

I hope you did not think I was suggesting that it could not possibly
work. I explained my stepwise approach to the problem precisely because
it led to a simple and clean solution. Maybe you took "academic" to
mean "impractical" -- I meant only "simplified for pedagogic reasons".
 
E

ela

Ben Bacarisse said:
Functions are crucial to managing complexity. I'd want a function
'most_frequent' that can take an array of values and find the frequency
of the most common value among them. It could return both that value
and the frequency. Something like:

I'd appreciate if I can learn more from you about the thinking philosophy.
As said previously, I only thought of a lot of "if"'s and "hash"'s and never
able to use function to wrap up some of the concepts. Would you mind telling
me by which cues trigger you to think about using function?
 
E

ela

Rainer Weikusat said:
ela said:
I've been working on this problem for 4 days and still cannot come out a
good solution and would appreciate if you could comment on the problem.

Given a table containing cells delimited by tab like this

[ please see original for the indeed gory details ]

Provided I understood the problem correctly, a possible solution could
look like this (this code has had very little testing): First, you
define your groups by associating array references containing the group
members with the 'group ID' with the help of a hash:

$grp{1} = [1, 2];

Then, you create a hash mapping the column name to the column value
for each ID and put these hashes into an id hash associated with the
ID:

$id{1} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC4' };
$id{2} = { F1 => 'SuperC1', F2 => 'C1', F3 => 'subC3' };

While I'm revising the codes, I find that just because I overrely on hash
and that complicates my problem. What made you make a decision on using
array for "group" while hash for "id"?
 
E

ela

Rainer Weikusat said:
Then, you'll need something similar to Ben's most_frequent routine

I managed to take frequency by:

$seen{$_}++ for (map { $id{$_}{$col} } @{$grp{$grpid}});

but when I wanna sort it by:

my @sorted_keys = sort { $seen{$b} <=> $seen{$a}}

the error "Use of uninitialized value in hash element at test.pl" appears,
so how to sort the values then?
 
E

ela

Tad McClellan said:
You have left off the list of things to be sorted.

adding %seen solves the problem though warning "Use of uninitialized value
in numeric comparison (<=>) at test.pl" still exists.

Due to time limitation, I shall focus on working out a draft solution first
and thank you very much for being patient in teaching me netiquette and a
lot of other stuff these years.
 
B

Ben Bacarisse

ela said:
Ben Bacarisse said:
my @column;
while (<>) {
chomp;
my (@row, $c) = split;
push @{$column[$c++]->[$row[0]]}, $_ foreach @row;
}

The slice @{$column[$col]}[@some_array_of_rows] is now and array of
array references so we need to flatten it.

This is a big part of the answer to this question:

Meanwhile, may I first ask what do you mean by "flatten" (@_ and {@$_} drive
me crazy...)?

The columns are now made up of array references where before they
contained simple scalars. That's true also for any slice from a
column. In order for most_frequent to do it's job it must be passed all
the elements in all of the these referenced arrays. This conversion is
often described as flattening. The operation that does it

map { @$_ } @some_array_of_array_references

relies on a slightly odd property of Perl. @$_ is the array that $_ is
a reference to, and Perl joins multiple arrays together in an array
context like this. In almost all other languages I can think of some
sort of list or array append operation would have to be explicit in such
a flattening function, but in Perl it's implicit.
It seems that a scalar and a hash should be passed to function
most_frequent

sub most_frequent
{
my ($most_freq, %count) = ('', '' => 0);

These are just local variables and tell you nothing about what should be
passed. The function goes on to process the argument array @_ treating
each element as a hash index for counting frequencies.

You'll know you've got the hang of it when you can explain the effects
of these three calls:

most_frequent((1,2,3),(2,3,4))
most_frequent([1,2,3],[2,3,4])
most_frequent(flatten([1,2,3],[2,3,4]))
 
E

ela

Ben Bacarisse said:
my @column;
while (<>) {
chomp;
my (@row, $c) = split;
push @{$column[$c++]->[$row[0]]}, $_ foreach @row;
}

The slice @{$column[$col]}[@some_array_of_rows] is now and array of
array references so we need to flatten it. A function to do that is

sub flatten { map {@$_} @_ }

The classify function might then be

sub classify
{
my ($group, $table, $threshold) = @_;
for (my $col = 1; $col < $#column; $col++) {
my ($item, $freq) =
most_frequent(flatten(@{@$table[$col]}[@$group]));
return $item if $freq >= $threshold;
}
return 'inconsistent';
}

It seems that I have to spend at least 2 weeks to upgrade myself in using
advanced perl data structures by reading this:
http://docstore.mik.ua/orelly/perl3/prog/ch09_01.htm

Meanwhile, may I first ask what do you mean by "flatten" (@_ and {@$_} drive
me crazy...)? It seems that a scalar and a hash should be passed to function
most_frequent

sub most_frequent
{
my ($most_freq, %count) = ('', '' => 0);
 
E

ela

Tad McClellan said:
ela said:
It seems that I have to spend at least 2 weeks to upgrade myself in using
advanced perl data structures by reading this:

[snip url]


Please do not post links to pirated copies of copyrighted material.

Please attempt to be an honorable person instead.

Instead of realizing the source to be pirated, I was too heavily relied on
"Google". I shall pay attention to that then.
^^^^^^^^^^^^^
^^^^^^^^^^^^^

There is no hash on the RHS, only a list of 3 scalars.

the 1st scalar is '', and the 2nd is the key '' and the 3rd for value '0'?
 
B

Ben Bacarisse

ela said:
I still don't quite get "flattening" means and therefore post the
codes.

Another phrase that might help: it means turning a nested structure into
one that is not nested anymore. An example being turning an array of
array references into a single array.
While I just understood yesterday that @$group is good in array structures
but sparse indices, I don't know how to pass the corresponding data
structures to subroutine classify, as Tad McClellan pointed out passing
anything to subroutine makes a long list compressing everything together, so
where do the subrountines separate the items correspondingly?

I think you are trying to learn Perl without a good source. The
documentation is excellent, but you have to read a lot of text to get
the big picture so I prefer a good book.

Anyway, to pass separate arrays to function you must use references.
classify should be called like this:

classify(\@group, \@column, 0.7)

It then gets one single array argument (called @_) with three elements:
two array references and a number. Maybe I should have used more
helpful names. The arguments to classify end up in variables called:

sub classify
{
my ($group, $table, $threshold) = @_;
...
}

The first two are references which is why the arrays are referred to
using the @$group, @$table syntax. $group_ref and $tab_ref might have
been better. I don't know.

open( ASM, "<asm.file" );
while (<ASM>) {
if (/\{CTG/) { $getCtgID = 1; }
elsif (/iid:(\d+)/){if($getCtgID){$group=$1;$getCtgID=0; }}
elsif (/src:(\d+)/) {
if ($1 % 2 != 1) { $ID = $1 - 1;
} else { $ID = $1; } # print $1; <STDIN>;
push @$group, $ID;
}
}
}

This can't be what you tried since it has too many }s. It's important to
post actual code. I can't unravel what you are doing with $group. You
use it to store a number in one place and later use it as an array
reference. Note that if it is to be an array reference, then you call
classify like this:

classify($group, \@column, 0.7)
sub flatten { map {@$_} @_ }

sub most_frequent
{
#print "MOST FREq", $most_freq; <STDIN>;
my ($most_freq, %count) = ('', '' => 0);
print "MOST FREq", $most_freq; <STDIN>;
for my $item (@_) {
$most_freq = $item if ++$count{$item} > $count{$most_freq};
}
return ($most_freq, $count{$most_freq}/@_);
}


sub classify
{
my ($group, $table, $threshold) = @_;
for (my $col = 9; $col > 6; $col--) {

Even though %table is an array reference, you can still use Perl's #
syntax to get the index of the last column:

for (my $col = $#$table - 1; $col > 6; $col--)

If this is part of a move complex program, I'd consider using the table
headings to populate a hash so I could map column names to indexes.

classify(@group, @column, 0.7);

I think you need

classify($group, \@column, 0.7);

here (and you need to do something with it's result) but if you correct
the earlier code to read the group into a plain array then you need

classify(\@group, \@column, 0.7);
 
E

ela

I still don't quite get "flattening" means and therefore post the codes.
While I just understood yesterday that @$group is good in array structures
but sparse indices, I don't know how to pass the corresponding data
structures to subroutine classify, as Tad McClellan pointed out passing
anything to subroutine makes a long list compressing everything together, so
where do the subrountines separate the items correspondingly?


#!/usr/bin/perl
use warnings;
use Switch;
use Data::Dumper;
use Statistics::Descriptive;

my $nl = "\n";
my ($conffile, $asmfile) = @ARGV;

my $outfile = $conffile . "_by_asm.xls";
open OUTFP, '>', $outfile or die "Can't open $outfile for writing: $!\n";
open my $r2fp, '<', $conffile or die "Can't open $conffile for reading:
$!\n";

#$conffile is a file with tab-delimited data like:
ID F9 F8 F7 F6 F5 F4 F3 F2 F1 Weight
1 hello dummy hi a bb oh SuperC1 C2 SubC1 0.5
1 hello dummy hi a bb oh SuperC1 C2 SubC3 0.5
2 ha dummy hi a bb oh SuperC1 C2 SubC3 1
3 ha dummy hi a bb oh SuperC1 C2 SubC4 1
....



my @column;
<$r2fp>;
while (<$r2fp>) {
chomp;
my (@row, $c) = split;
push @{$column[$c++]->[$row[0]]}, $_ foreach @row;
}

open( ASM, "<asm.file" );
while (<ASM>) {
if (/\{CTG/) { $getCtgID = 1; }
elsif (/iid:(\d+)/){if($getCtgID){$group=$1;$getCtgID=0; }}
elsif (/src:(\d+)/) {
if ($1 % 2 != 1) { $ID = $1 - 1;
} else { $ID = $1; } # print $1; <STDIN>;
push @$group, $ID;
}
}
}

sub flatten { map {@$_} @_ }

sub most_frequent
{
#print "MOST FREq", $most_freq; <STDIN>;
my ($most_freq, %count) = ('', '' => 0);
print "MOST FREq", $most_freq; <STDIN>;
for my $item (@_) {
$most_freq = $item if ++$count{$item} > $count{$most_freq};
}
return ($most_freq, $count{$most_freq}/@_);
}


sub classify
{
my ($group, $table, $threshold) = @_;
for (my $col = 9; $col > 6; $col--) {
my ($item, $freq) =
most_frequent(flatten(@{@$table[$col]}[@$group]));
return $item if $freq >= $threshold;
}
return 'inconsistent';
}

# most_frequent((1,2,3),(2,3,4));
# most_frequent([1,2,3],[2,3,4]);
# my ($item, $freq) = most_frequent(flatten([3,3,3],[2,3,3]));
# print "ITEM", $item, "FREQ", $freq;

classify(@group, @column, 0.7);
 
J

Jürgen Exner

ela said:
I still don't quite get "flattening" means and therefore post the codes.
While I just understood yesterday that @$group is good in array structures
but sparse indices, I don't know how to pass the corresponding data
structures to subroutine classify, as Tad McClellan pointed out passing
anything to subroutine makes a long list compressing everything together,

Perfect! So you do understand what flattening means because that is it.

Any list that is embedded in the argument list, is simply expanded until
the argument list contains only scalars. Example:
(1, (2, 3, 4), ((a, b), 5)) # the list contains only 3 elements!
is flattened into
(1, 2, 3, 4, a, b, 5) # now it contains 7 elements
so
where do the subrountines separate the items correspondingly?

They don't. Period.
If you want to pass complex data structures to a subroutine then you
have to pass references to those data structures.

jue
 
B

Ben Bacarisse

ela said:
Yes, one more "}" (pointed by <--------------) was included. I removed other
irrelevant condition checking which does not affect the "group"
construction. In fact, I have to construct group by parsing a file that
contains keywords like "CTG" followed by "iid", which indicate the group
numbers, e.g. 1,2,3,4, ... 19

and then for each group, the member ID's are labeled with the keyword "src",
so I used the code

push @$group, $ID

to assign the members to the group they belong to.

Ah, that code does not do what you want then. Just use a plain array
for the groups. Each element of this array will itself be an array
(actually an array reference, of course).

When you see group number, set $group_id = $1;. Wehn you see a number
that is a member of the current group, you do

push @groups[$group_id], $1;

(there's no need to set $ID and then push $ID).
Even though %table is an array reference, you can still use Perl's #
syntax to get the index of the last column:

for (my $col = $#$table - 1; $col > 6; $col--)

If this is part of a move complex program, I'd consider using the table
headings to populate a hash so I could map column names to indexes.

Yes, you are brilliant to know that I have done that. Indeed I do something
like:

open my $r2fp, '<', $conffile or die "Can't open $conffile for reading:
$!\n";
my $line = <$r2fp>; chomp $line;
my @fields = split(/\t/, $line);
for (my $i=0; $i<@fields; $i++) { # print $fields[$i]; <STDIN>;
############## check fields ######
if ( $fields[$i] =~ /^Identity of query$/i) {
$qryii = $i; # print $qryii; <STDIN>;
} elsif ( $fields[$i] =~ /^Country$/i) { $fii = $i;
} elsif ( $fields[$i] =~ /^City$/i) { $gii = $i;
} elsif ( $fields[$i] =~ /^Street$/i){ $sii = $i;
}
}

Actually that's not what I meant though it is similar. I think I'd find
those $fii, $gii and so on rather confusing but if it works for you...
while (<$r2fp>) {
chomp;
my (@row, $c) = split;
push @{$column[$c++]->[$row[0]]}, $_ foreach @row;
}

<snip>

@catchk = ($sii, $gii, $fii);
for $col (@catchk) {
my ($item, $freq) =
most_frequent(flatten(@{@$table[$col]}[@$group]));
return $item if $freq >= $threshold;
}

but now I start to realize that
"most_frequent(flatten(@{@$table[$col]}[@$group]));" may not be doing the
job I mean, likely due to my unclear problem specification.

I have 2 files to parse:
one file contains the group membership information and after parsing that
file, I should have something like:

grp 1: [1,19,22,387]
grp 2: [9,18,101]
grp 3: [2, 1119]
...

and another file that contains a large table that contains no group
membership information but only the ID's like:

1 hello blahblah ... UK London Downing 0.5
1 hello blahblah ... US New York Wall 0.5
2 hi ...

so the first two rows, having the ID "1", should belong to grp 1, the third
row, should belong to grp 3.

You are right, I don't understand the problem anymore, but that does not
matter as long as you know what you need.
by using {@$table[$col]}[@$group], it appears to mean that by knowing the ID
can immediately also know the group number...

I don't see how it can mean that. It might be that I've been addressing
the wrong problem, but maybe now you have some more tools you can use to
solve the problem you actually have.
 
B

Ben Bacarisse

Jürgen Exner said:
Perfect! So you do understand what flattening means because that is it.

Any list that is embedded in the argument list, is simply expanded until
the argument list contains only scalars. Example:
(1, (2, 3, 4), ((a, b), 5)) # the list contains only 3 elements!
is flattened into
(1, 2, 3, 4, a, b, 5) # now it contains 7 elements

I should point out that I was using the term in the very general sense
of flattening data structures, because I needed to turn an array of
array references into a single list argument and, of course, Perl does
not do that automatically. The (for a beginner) slightly baffling code

map { @$_ } @_

uses Perl's list flattening to do this more general flattening.

Tad McClellan pointed out that I did not make the proper distinction
between lists and arrays, and he's quite right about that, but it was
deliberate. I was trying to avoid getting into it though I accept that
that might have turned out to be an unhelpful thing to have done.

<snip>
 
B

Ben Bacarisse

ela said:
Ben Bacarisse said:
When you see group number, set $group_id = $1;. Wehn you see a number
that is a member of the current group, you do

push @groups[$group_id], $1;

After changing as above, the following errors appear...

Scalar value @group[$grpID] better written as $group[$grpID] at test3.pl
line 50.
Type of arg 1 to push must be array (not array slice) at test3.pl line 50,
near "$1;"
Execution of test3.pl aborted due to compilation errors.

Yes, my mistake, but it's a shame you could not see the problem and
correct it yourself:

push @{$groups[$group_id]}, $1;
 
E

ela

Ben Bacarisse said:
Another phrase that might help: it means turning a nested structure into
one that is not nested anymore. An example being turning an array of
array references into a single array.

Thanks a lot. I finally get the meaning of "flattening".
This can't be what you tried since it has too many }s. It's important to
post actual code. I can't unravel what you are doing with $group. You
use it to store a number in one place and later use it as an array
reference.

Yes, one more "}" (pointed by <--------------) was included. I removed other
irrelevant condition checking which does not affect the "group"
construction. In fact, I have to construct group by parsing a file that
contains keywords like "CTG" followed by "iid", which indicate the group
numbers, e.g. 1,2,3,4, ... 19

and then for each group, the member ID's are labeled with the keyword "src",
so I used the code

push @$group, $ID

to assign the members to the group they belong to.
Even though %table is an array reference, you can still use Perl's #
syntax to get the index of the last column:

for (my $col = $#$table - 1; $col > 6; $col--)

If this is part of a move complex program, I'd consider using the table
headings to populate a hash so I could map column names to indexes.

Yes, you are brilliant to know that I have done that. Indeed I do something
like:

open my $r2fp, '<', $conffile or die "Can't open $conffile for reading:
$!\n";
my $line = <$r2fp>; chomp $line;
my @fields = split(/\t/, $line);
for (my $i=0; $i<@fields; $i++) { # print $fields[$i]; <STDIN>;
############## check fields ######
if ( $fields[$i] =~ /^Identity of query$/i) {
$qryii = $i; # print $qryii; <STDIN>;
} elsif ( $fields[$i] =~ /^Country$/i) { $fii = $i;
} elsif ( $fields[$i] =~ /^City$/i) { $gii = $i;
} elsif ( $fields[$i] =~ /^Street$/i){ $sii = $i;
}
}
while (<$r2fp>) {
chomp;
my (@row, $c) = split;
push @{$column[$c++]->[$row[0]]}, $_ foreach @row;
}

<snip>

@catchk = ($sii, $gii, $fii);
for $col (@catchk) {
my ($item, $freq) =
most_frequent(flatten(@{@$table[$col]}[@$group]));
return $item if $freq >= $threshold;
}

but now I start to realize that
"most_frequent(flatten(@{@$table[$col]}[@$group]));" may not be doing the
job I mean, likely due to my unclear problem specification.

I have 2 files to parse:
one file contains the group membership information and after parsing that
file, I should have something like:

grp 1: [1,19,22,387]
grp 2: [9,18,101]
grp 3: [2, 1119]
....

and another file that contains a large table that contains no group
membership information but only the ID's like:

1 hello blahblah ... UK London Downing 0.5
1 hello blahblah ... US New York Wall 0.5
2 hi ...

so the first two rows, having the ID "1", should belong to grp 1, the third
row, should belong to grp 3.

by using {@$table[$col]}[@$group], it appears to mean that by knowing the ID
can immediately also know the group number...
 
E

ela

Ben Bacarisse said:
When you see group number, set $group_id = $1;. Wehn you see a number
that is a member of the current group, you do

push @groups[$group_id], $1;

After changing as above, the following errors appear...

Scalar value @group[$grpID] better written as $group[$grpID] at test3.pl
line 50.
Type of arg 1 to push must be array (not array slice) at test3.pl line 50,
near "$1;"
Execution of test3.pl aborted due to compilation errors.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top