complex problem

W

Willem

ela wrote:
) I have a hundred of files that have predefined columns but unknown number of
) rows:
<snip>
) While using regular expression to extract the information is easy, and then
) building the table by associative array is also simple, the main problem
) falls into padding the previous columns with no records.

Easy. Do two passes over the data.
First pass finds max number of columns, second pass does all the padding.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
E

ela

I have a hundred of files that have predefined columns but unknown number of
rows:

for example, a row from file1 is like this (delimited by \t)

chr19 56837617 56841944
SIGLEC14(NM_001098612,expression:8.88665993183852) 0 - 56837617
56841944 255,12,12 1 4327 0

what I have to do is to build a big table containing information:

Name\tExpression_file1\tExpression_file2\t ...
Expression_file99\tExpression_file100\n
NM_001098612\t8.88665993183852\t ...\n
....

While using regular expression to extract the information is easy, and then
building the table by associative array is also simple, the main problem
falls into padding the previous columns with no records. While the current
column can be safely padded, the previous columns will have to look up more
and more. Recursion seems to be the solution after my trial on using
"for-loops", but the recursion routine appears more difficult than I
previously thought...
 
X

Xho Jingleheimerschmidt

ela said:
I have a hundred of files that have predefined columns but unknown number of
rows:

for example, a row from file1 is like this (delimited by \t)

chr19 56837617 56841944
SIGLEC14(NM_001098612,expression:8.88665993183852) 0 - 56837617
56841944 255,12,12 1 4327 0

Word wrap makes this rather difficult to read.
what I have to do is to build a big table containing information:

Name\tExpression_file1\tExpression_file2\t ...
Expression_file99\tExpression_file100\n
NM_001098612\t8.88665993183852\t ...\n
....

While using regular expression to extract the information is easy, and then
building the table by associative array is also simple, the main problem
falls into padding the previous columns with no records.

Why is that a problem? If all the files are passed in on @ARGV, and
each file of input is turned into a new column in the output, then you
already know what all the columns in the output are going to be, right
up front.

While the current
column can be safely padded, the previous columns will have to look up more
and more. Recursion seems to be the solution after my trial on using
"for-loops", but the recursion routine appears more difficult than I
previously thought...

I don't understand how recursion could plausibly be useful here.

Anyway, what I often find myself doing is using two hashes.

my %exp;
my %sample;
while (<>) {
my ($refseq,$expression,$sample)=parse_however($_);
$exp{$refseq}{$sample}=$expression;
$sample{$sample}=();
};

Now %sample contains an entry for every sample/tissue/file which has at
least one second-level entry in %exp.

Of course you could have reversed the nesting order of the keys,
$exp{$sample}{$refseq}, but I assume that would be inconvenient for
other reasons, or you would have done it already.


Xho
 
X

Xho Jingleheimerschmidt

ela said:
sorry but would you mind elaborating why the second hash can do the purpose?
I can't quite follow it...

Sorry, but it seems self-evident to me, so I don't see how I can explain
it. Maybe I'm not correctly apprehending what the purpose is that you
have in mind.

Xho
 
E

ela

Anyway, what I often find myself doing is using two hashes.
my %exp;
my %sample;
while (<>) {
my ($refseq,$expression,$sample)=parse_however($_);
$exp{$refseq}{$sample}=$expression;
$sample{$sample}=();
};

Now %sample contains an entry for every sample/tissue/file which has at
least one second-level entry in %exp.

sorry but would you mind elaborating why the second hash can do the purpose?
I can't quite follow it...
 
X

Xho Jingleheimerschmidt

ela said:
Let me give a simple example:

File 1
ID character
1 A
2 T
3 G

File 2
ID character
1 A
3 T

File 3
ID character
2 A
3 T
4 C

processed result
ID File 1 character File 2 character File 3 character
1 A A n/a
2 T n/a A
3 G T T
4 n/a n/a C

OK, so you have two hashes, one of them multi-level.

The multilevel one is $base{$id}{$file}=$nucleotide, so it contains all
of the data.

The other one is just $file{$file}=(), so it tells you every file, i.e.
every column that needs to exist in the output, so that you can reserve
space for them all, even if a given $id doesn't have data for given $file.

Once you are done reading all the files, you'd probably want something like:

my @file=sort keys %file; # might want a non-default sort method.

to put that data into a more convenient format for using.

Then:
foreach my $id (keys %base) {
my @output = @{$id{$base}}{@file};
defined $_ or $_='n/a' foreach @output;
print join ("\t", $id, @output), "\n";
};


I've changed the names of the hashes from the previous post, because you
changed the nature of the data contained in them from your previous example.

Xho
 
X

Xho Jingleheimerschmidt

Tad said:
Please put the subject of your article in the Subject of your article.

If he attempted to do that, he would be accused to committing the XYZ
problem.

Xho
 
E

ela

Sorry, but it seems self-evident to me, so I don't see how I can explain
it. Maybe I'm not correctly apprehending what the purpose is that you
have in mind.

Xho

Let me give a simple example:

File 1
ID character
1 A
2 T
3 G

File 2
ID character
1 A
3 T

File 3
ID character
2 A
3 T
4 C

processed result
ID File 1 character File 2 character File 3 character
1 A A n/a
2 T n/a A
3 G T T
4 n/a n/a C
 
S

sln

Let me give a simple example:

File 1
ID character
1 A
2 T
3 G

File 2
ID character
1 A
3 T

File 3
ID character
2 A
3 T
4 C

processed result
ID File 1 character File 2 character File 3 character
1 A A n/a
2 T n/a A
3 G T T
4 n/a n/a C

This can be done more than one way.
$data[ $id ][ file number ] id as index, depends on id being small and integer
or
$data{ $id }[ file number ] id as hash key, the keys are not sorted

-sln
--------------
use strict;
use warnings;

my $f1 =<<EO1;
1 A
2 T
3 G
EO1
my $f2 =<<EO2;
1 A
3 T
EO2
my $f3 =<<EO3;
2 A
3 T
4 C
EO3

my %data;
my $filecount = 0;

# Put file list in the column order
# of the output wanted
# --------------------------
for my $file (\$f1, \$f2, \$f3)
{
open my $fh, '<', $file or die "can't open $file: @!";

while ( defined (my $line = <$fh>) )
{
my ($id, $char) = parseline ($line);
next unless defined $id;
$data{ $id }[ $filecount + 1 ] = $char;
unless (defined $data{ $id }[ 0 ]) {
$data{ $id }[ 0 ] = $id;
}
}
close $fh;
++$filecount;
}

for my $id (sort keys %data)
{
for my $count (0 .. $filecount) {
$data{ $id }[ $count ] = 'n/a'
unless defined ($data{ $id }[ $count ]);
}
print " '@{$data{ $id }}'\n";
}

sub parseline
{
return $_[0] =~ /\s*(\S+)\s+(\S+)\s*/;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top