complex problem

Willem · Sep 24, 2010

ela wrote:
) I have a hundred of files that have predefined columns but unknown number of
) rows:
<snip>
) While using regular expression to extract the information is easy, and then
) building the table by associative array is also simple, the main problem
) falls into padding the previous columns with no records.

Easy. Do two passes over the data.
First pass finds max number of columns, second pass does all the padding.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

ela · Sep 25, 2010

I have a hundred of files that have predefined columns but unknown number of
rows:

for example, a row from file1 is like this (delimited by \t)

chr19 56837617 56841944
SIGLEC14(NM_001098612,expression:8.88665993183852) 0 - 56837617
56841944 255,12,12 1 4327 0

what I have to do is to build a big table containing information:

Name\tExpression_file1\tExpression_file2\t ...
Expression_file99\tExpression_file100\n
NM_001098612\t8.88665993183852\t ...\n
....

While using regular expression to extract the information is easy, and then
building the table by associative array is also simple, the main problem
falls into padding the previous columns with no records. While the current
column can be safely padded, the previous columns will have to look up more
and more. Recursion seems to be the solution after my trial on using
"for-loops", but the recursion routine appears more difficult than I
previously thought...

Xho Jingleheimerschmidt · Sep 25, 2010

ela said:
I have a hundred of files that have predefined columns but unknown number of
rows:

for example, a row from file1 is like this (delimited by \t)

chr19 56837617 56841944
SIGLEC14(NM_001098612,expression:8.88665993183852) 0 - 56837617
56841944 255,12,12 1 4327 0

Word wrap makes this rather difficult to read.

what I have to do is to build a big table containing information:

Name\tExpression_file1\tExpression_file2\t ...
Expression_file99\tExpression_file100\n
NM_001098612\t8.88665993183852\t ...\n
....

While using regular expression to extract the information is easy, and then
building the table by associative array is also simple, the main problem
falls into padding the previous columns with no records.

Why is that a problem? If all the files are passed in on @ARGV, and
each file of input is turned into a new column in the output, then you
already know what all the columns in the output are going to be, right
up front.

While the current
column can be safely padded, the previous columns will have to look up more
and more. Recursion seems to be the solution after my trial on using
"for-loops", but the recursion routine appears more difficult than I
previously thought...

I don't understand how recursion could plausibly be useful here.

Anyway, what I often find myself doing is using two hashes.

my %exp;
my %sample;
while (<>) {
my ($refseq,$expression,$sample)=parse_however($_);
$exp{$refseq}{$sample}=$expression;
$sample{$sample}=();
};

Now %sample contains an entry for every sample/tissue/file which has at
least one second-level entry in %exp.

Of course you could have reversed the nesting order of the keys,
$exp{$sample}{$refseq}, but I assume that would be inconvenient for
other reasons, or you would have done it already.

Xho

Xho Jingleheimerschmidt · Sep 25, 2010

ela said:
sorry but would you mind elaborating why the second hash can do the purpose?
I can't quite follow it...

Sorry, but it seems self-evident to me, so I don't see how I can explain
it. Maybe I'm not correctly apprehending what the purpose is that you
have in mind.

Xho

ela · Sep 26, 2010

Anyway, what I often find myself doing is using two hashes.

my %exp;
my %sample;
while (<>) {
my ($refseq,$expression,$sample)=parse_however($_);
$exp{$refseq}{$sample}=$expression;
$sample{$sample}=();
};

Now %sample contains an entry for every sample/tissue/file which has at
least one second-level entry in %exp.

sorry but would you mind elaborating why the second hash can do the purpose?
I can't quite follow it...

Xho Jingleheimerschmidt · Sep 26, 2010

ela said:
Let me give a simple example:

File 1
ID character
1 A
2 T
3 G

File 2
ID character
1 A
3 T

File 3
ID character
2 A
3 T
4 C

processed result
ID File 1 character File 2 character File 3 character
1 A A n/a
2 T n/a A
3 G T T
4 n/a n/a C

OK, so you have two hashes, one of them multi-level.

The multilevel one is $base{$id}{$file}=$nucleotide, so it contains all
of the data.

The other one is just $file{$file}=(), so it tells you every file, i.e.
every column that needs to exist in the output, so that you can reserve
space for them all, even if a given $id doesn't have data for given $file.

Once you are done reading all the files, you'd probably want something like:

my @file=sort keys %file; # might want a non-default sort method.

to put that data into a more convenient format for using.

Then:
foreach my $id (keys %base) {
my @output = @{$id{$base}}{@file};
defined $_ or $_='n/a' foreach @output;
print join ("\t", $id, @output), "\n";
};

I've changed the names of the hashes from the previous post, because you
changed the nature of the data contained in them from your previous example.

Xho

Xho Jingleheimerschmidt · Sep 26, 2010

Tad said:
Please put the subject of your article in the Subject of your article.

If he attempted to do that, he would be accused to committing the XYZ
problem.

Xho

ela · Sep 27, 2010

Sorry, but it seems self-evident to me, so I don't see how I can explain
it. Maybe I'm not correctly apprehending what the purpose is that you
have in mind.

Xho

Let me give a simple example:

File 1
ID character
1 A
2 T
3 G

File 2
ID character
1 A
3 T

File 3
ID character
2 A
3 T
4 C

processed result
ID File 1 character File 2 character File 3 character
1 A A n/a
2 T n/a A
3 G T T
4 n/a n/a C

RedGrittyBrick · Sep 27, 2010

If he attempted to do that, he would be accused to committing the XYZ
problem.

If I were the OP, I'd view being informed of both as a boon.

P.S. Thanks for drawing my attention to the XYZ problemâ€ , previously I'd
only known of the XY problem.

â€ http://www.perlmonks.org/index.pl?node_id=6672

sln · Sep 27, 2010

Let me give a simple example:

File 1
ID character
1 A
2 T
3 G

File 2
ID character
1 A
3 T

File 3
ID character
2 A
3 T
4 C

processed result
ID File 1 character File 2 character File 3 character
1 A A n/a
2 T n/a A
3 G T T
4 n/a n/a C

This can be done more than one way.
$data[ $id ][ file number ] id as index, depends on id being small and integer
or
$data{ $id }[ file number ] id as hash key, the keys are not sorted

-sln
--------------
use strict;
use warnings;

my $f1 =<<EO1;
1 A
2 T
3 G
EO1
my $f2 =<<EO2;
1 A
3 T
EO2
my $f3 =<<EO3;
2 A
3 T
4 C
EO3

my %data;
my $filecount = 0;

# Put file list in the column order
# of the output wanted
# --------------------------
for my $file (\$f1, \$f2, \$f3)
{
open my $fh, '<', $file or die "can't open $file: @!";

while ( defined (my $line = <$fh>) )
{
my ($id, $char) = parseline ($line);
next unless defined $id;
$data{ $id }[ $filecount + 1 ] = $char;
unless (defined $data{ $id }[ 0 ]) {
$data{ $id }[ 0 ] = $id;
}
}
close $fh;
++$filecount;
}

for my $id (sort keys %data)
{
for my $count (0 .. $filecount) {
$data{ $id }[ $count ] = 'n/a'
unless defined ($data{ $id }[ $count ]);
}
print " '@{$data{ $id }}'\n";
}

sub parseline
{
return $_[0] =~ /\s*(\S+)\s+(\S+)\s*/;
}

Java matrix problem	3	Sep 10, 2023
Why does sort return undef in scalar context ?	62	Aug 29, 2011
Show what a substitution is doing ?	5	Mar 24, 2010
Error when combining threads and system()	10	Jan 8, 2010
Son of Snarky Tirade: a response to Seebach's new CTCN: part 1	100	Apr 11, 2010
Question on while(!feof(fp))	5	Jan 30, 2008
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
[newbie] Recursive algorithm - review	5	Jan 4, 2014

complex problem

Willem

ela

Xho Jingleheimerschmidt

Xho Jingleheimerschmidt

ela

Xho Jingleheimerschmidt

Xho Jingleheimerschmidt

ela

RedGrittyBrick

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads