math math said:
I have a file with two tab delimited fields. First field is an ID,
the second field is a large string (up to hundreds of millions of
characters). The file may have many lines.
I would like to sort the file on the first (ID) field and after this
sorting, merge the second fields (i.e. remove the new lines), so
that I get a single line with many hundreds of lines that are in the
same order appended to each other as their alphabetically sorted
IDs.
Is there a way to do that in PERL without reading the whole file into memory?
Provided it is Ok to keep all IDs in memory, something like the code
below could be used. The basic idea is to parse the complete input
file once line-by-line and create an array of 'ID records'. Each ID
record is a reference to a two-element array. The first member is the
ID itself, the second is the (stream-) position where the data part of
this line start (sub get_ids). This array is then sorted by
ID. Afterwards, the code goes through the sorted array, creating the
output lines by starting a new output line whenever a new ID occurs
and concatenating the data parts by seeking to the recorded position
associated with an ID record, reading 'the remainder of the input line' and
printing it (sub generate output).
NB: For brevity, this omits all error handling. It is assumed that
invalid input lines don't occur. Replacing the read_id_data call with
inline code would be an obvious performance improvement. The tab
character separating the ID from the data part is treated as part of
the ID.
NB^2: A probably much faster way to do this would be to go through the
file line-by-line, use a hash to record 'seen' IDs, create a per-ID
output file whenever a new ID is encountered, write the data part of
each input line to the output file for its ID, add a trailing \n to
all output files the input file was completely processed and merge the
output files together in a final processing step.
------------------------
#!/usr/bin/perl
#
sub get_ids
{
my ($in, $ids) = @_;
my ($line, $the_id, $pos);
$pos = tell($in);
while ($line = <$in>) {
$line =~ /^([^\t]+\t)/ and $the_id = $1;
push(@$ids, [$the_id, $pos + length($the_id)]);
$pos = tell($in);
}
}
sub read_id_data
{
my ($fh, $id) = @_;
my $data;
seek($fh, $id->[1], 0);
$data = <$fh>;
chop($data);
return $data;
}
sub generate_output
{
my ($fh, $ids) = @_;
my ($last_id, $n);
$last_id = $ids->[0][0];
print($last_id, read_id_data($fh, $ids->[0]));
while (++$n < @$ids) {
if ($ids->[$n][0] ne $last_id) {
$last_id = $ids->[$n][0];
print("\n", $last_id);
}
print(read_id_data($fh, $ids->[$n]));
}
print("\n");
}
{
my ($fh, @ids);
open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");
get_ids($fh, \@ids);
@ids = sort { $a->[0] cmp $b->[0] } @ids;
generate_output($fh, \@ids);
}