question concerning pipes and large strings

math math · Jun 19, 2012

Hi,

I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.

I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get a single line with many hundreds of lines that are in the same order appended toeach other as their alphabetically sorted IDs.

Is there a way to do that in PERL without reading the whole file into memory?

Thanks

math math · Jun 19, 2012

Hi,

I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.

I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get a single line with many hundreds of lines that are in the same order appended to each other as their alphabetically sorted IDs.

Is there a way to do that in PERL without reading the whole file into memory?

Thanks

So my first attempt was using a pipe:

open(my $fhd_pipe, "sort -d -k1,1 $my_input_file |" );
open(my $fhd_out, ">test.txt" ) or die $!;
while(<$fhd_pipe>) {
chomp $sequence;
my (undef, $sequence) = split("\t", $_);
print $fhd_out $sequence;
}
close $fhd_out;
close $fhd_pipe;

But with this approach, I cannot check the return value of the first "open"command, which is quite annoying (what if something goes wrong?)

J. Gleixner · Jun 19, 2012

Hi,

I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.

I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get a single line with many hundreds of lines that are in the same order appended to each other as their alphabetically sorted IDs.

Is there a way to do that in PERL without reading the whole file into memory?

Thanks

What have you tried?

Probably sorting it first would make it much easier:

man sort

These should help with the rest:

perldoc -f open
perldoc -f chomp

Rainer Weikusat · Jun 19, 2012

math math said:
I have a file with two tab delimited fields. First field is an ID,
the second field is a large string (up to hundreds of millions of
characters). The file may have many lines.

I would like to sort the file on the first (ID) field and after this
sorting, merge the second fields (i.e. remove the new lines), so
that I get a single line with many hundreds of lines that are in the
same order appended to each other as their alphabetically sorted
IDs.

Is there a way to do that in PERL without reading the whole file into memory?

Provided it is Ok to keep all IDs in memory, something like the code
below could be used. The basic idea is to parse the complete input
file once line-by-line and create an array of 'ID records'. Each ID
record is a reference to a two-element array. The first member is the
ID itself, the second is the (stream-) position where the data part of
this line start (sub get_ids). This array is then sorted by
ID. Afterwards, the code goes through the sorted array, creating the
output lines by starting a new output line whenever a new ID occurs
and concatenating the data parts by seeking to the recorded position
associated with an ID record, reading 'the remainder of the input line' and
printing it (sub generate output).

NB: For brevity, this omits all error handling. It is assumed that
invalid input lines don't occur. Replacing the read_id_data call with
inline code would be an obvious performance improvement. The tab
character separating the ID from the data part is treated as part of
the ID.

NB^2: A probably much faster way to do this would be to go through the
file line-by-line, use a hash to record 'seen' IDs, create a per-ID
output file whenever a new ID is encountered, write the data part of
each input line to the output file for its ID, add a trailing \n to
all output files the input file was completely processed and merge the
output files together in a final processing step.

------------------------
#!/usr/bin/perl
#

sub get_ids
{
my ($in, $ids) = @_;
my ($line, $the_id, $pos);

$pos = tell($in);
while ($line = <$in>) {
$line =~ /^([^\t]+\t)/ and $the_id = $1;
push(@$ids, [$the_id, $pos + length($the_id)]);

$pos = tell($in);
}
}

sub read_id_data
{
my ($fh, $id) = @_;
my $data;

seek($fh, $id->[1], 0);
$data = <$fh>;
chop($data);

return $data;
}

sub generate_output
{
my ($fh, $ids) = @_;
my ($last_id, $n);

$last_id = $ids->[0][0];
print($last_id, read_id_data($fh, $ids->[0]));

while (++$n < @$ids) {
if ($ids->[$n][0] ne $last_id) {
$last_id = $ids->[$n][0];
print("\n", $last_id);
}

print(read_id_data($fh, $ids->[$n]));
}

print("\n");
}

{
my ($fh, @ids);

open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");

get_ids($fh, \@ids);
@ids = sort { $a->[0] cmp $b->[0] } @ids;
generate_output($fh, \@ids);
}

Rainer Weikusat · Jun 19, 2012

Ben Morrow said:
Quoth math math said:

I have a file with two tab delimited fields. First field is an ID, the
second field is a large string (up to hundreds of millions of
characters). The file may have many lines.

Click to expand...

[...]

I would like to sort the file on the first (ID) field and after this
sorting, merge the second fields (i.e. remove the new lines), so that
I get a single line with many hundreds of lines that are in the same
order appended to each other as their alphabetically sorted IDs.

Click to expand...

[...]

I would do this in two passes. Start by reading through the file a block
at a time, and finding all the ID fields. (I am assuming these are small
enough that you aren't worried about keeping the whole list in
memory).

If the file is large and entries with identical ID are not somehow
clustered, this will result in a lot of (useless) I/O.

For each ID, remember the start and finish offset of the second field
(using tell, or by keeping a running count of where you are in the file,
whichever's easier). Put these in a hash keyed by ID.

Then, you can sort the list of IDs, and, for each ID, seek to the right
place in the old file and pull out the data item for the new file
directly.

Minus some simplifications, this is a much better idea than building
an ID array in the way I suggested (all caveats still apply).

-----------------------
#!/usr/bin/perl
#

sub get_ids
{
my ($in, $ids) = @_;
my ($line, $the_id, $pos);

$pos = tell($in);
while ($line = <$in>) {
$line =~ /^([^\t]+\t)/ and $the_id = $1;
push(@{$ids->{$the_id}}, $pos + length($the_id));

$pos = tell($in);
}
}

sub read_data_at
{
my ($fh, $pos) = @_;
my $data;

seek($fh, $pos, 0);
$data = <$fh>;
chop($data);

return $data;
}

{
my ($fh, %ids);

open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");

get_ids($fh, \%ids);

for (sort(keys(%ids))) {
print($_);

print(read_data_at($fh, $_)) for @{$ids{$_}};
print("\n");
}
}

Xho Jingleheimerschmidt · Jun 20, 2012

So my first attempt was using a pipe:

open(my $fhd_pipe, "sort -d -k1,1 $my_input_file |" );
open(my $fhd_out, ">test.txt" ) or die $!;
while(<$fhd_pipe>) {
chomp $sequence;
my (undef, $sequence) = split("\t", $_);
print $fhd_out $sequence;
}
close $fhd_out;
close $fhd_pipe;

But with this approach, I cannot check the return value of the first "open" command

You can (and should) but it will only tell you if the shell failed to
start, not if the sort failed to find the named input file.

, which is quite annoying (what if something goes wrong?)

You can still check the return value of the "close $fhd_pipe", which
should fail if an error in the sort occurred. Also, either the sort or
the shell should send a message to stderr. Whether you are likely to
notice that depends on unspecified details.

Xho

Rainer Weikusat · Jun 20, 2012

Ben Morrow said:
Quoth Rainer Weikusat said:

Ben Morrow said:

Quoth math math <[email protected]>:
I have a file with two tab delimited fields. First field is an ID, the
second field is a large string (up to hundreds of millions of
characters). The file may have many lines.
[...]

I would like to sort the file on the first (ID) field and after this
sorting, merge the second fields (i.e. remove the new lines), so that
I get a single line with many hundreds of lines that are in the same
order appended to each other as their alphabetically sorted IDs.
[...]

I would do this in two passes. Start by reading through the file a block
at a time, and finding all the ID fields. (I am assuming these are small
enough that you aren't worried about keeping the whole list in
memory).

Click to expand...

If the file is large and entries with identical ID are not somehow
clustered, this will result in a lot of (useless) I/O.

Click to expand...

How so?

Because the complete contents of the file won't fit into the kernel
page cache and this means whenever a block of data needs to be read
which presently isn't in the page cache, one of the existing blocks
needs to be evicted and the data read 'from disk'. This can happen
numerous times when randomly seeking back and forth within the file.

The same phenomenon will occur at the perl buffering level except that
it will likely be much more severe (in terms of system-call overhead)
because the perl-level buffer will likely (this may be wrong) be much
smaller than the kernel page cache.

[...]

You're still reading an entire line into memory.

Can you imagine that I know that? Actually, that just about everyone
reading your text will know that?

If you want an opinion on this: If a single line of input is too large
to be kept in memory, Perl is decidedly the wrong choice for solving
this problem.

$line =~ /^([^\t]+\t)/ and $the_id = $1;
push(@{$ids->{$the_id}}, $pos + length($the_id));

Click to expand...

Ah, you are assuming IDs might be duplicated. I was assuming they were
unique, and just needed sorting. The OP will have to clarify this.

The idea I got from the text of the OP was that he wanted to turn
multiple entries for a given ID into a single line while continuing to
have multiple entries for different IDs. After rereading his text, I
think that was probably wrong. An simple implementation of the
'concatenate everything in ID-sorted order' with a sensible I/O
stragey:

----------------
#!/usr/bin/perl
#

use Errno qw(EMFILE ENFILE);

{
my ($in, $open, $id, %ids, @open, $input);

while ($input = <STDIN>) {
($id) = $input =~ /^([^\t]+)\t/;
chop($input);

until (defined(open($out, '+>', $id))) {
die("open: $in: $!")
unless ($! == EMFILE || $! == ENFILE) && @open;

$ids{pop(@open)} = undef;
}

print $out (substr($input, length($id) + 1));

$ids{$id} = $out;
push(@open, $id);
$out = undef;
}

for (sort(keys(%ids))) {
$in = $ids{$_};

if ($in) {
seek($in, 0, 0);
} else {
until (defined(open($in, '<', $_))) {
die("open/2: $in: $!")
unless $! == EMFILE || $! == ENFILE;

1 while $open = pop(@open) and !$ids{$open};
$open || die("noting to close");

$ids{$open} = undef;
}
}

print(<$in>);
$in = $ids{$_} = undef;
unlink($_);
}

print("\n");
}

math math · Jun 20, 2012

How large is the file altogether? A few hundred megabytes, or
substantially larger? It may well not be worth trying to avoid doing
this in memory.

I don't entirely understand what you mean here. If you began with a file
that looked like

002 TWO
001 ONE
003 THREE

what would you expect to end up with? Perhaps you want

ONE TWO THREE

That's correct.

with the IDs removed?

I see other people have recommended sort(1); I would *not* recommend
that, in this case. sort(1) will deal with large files by spilling out
to temporary files on disk, but there's no need for that here.

Would sort(1) still create large files if the sort field is only on the ID field (i.e. sort -k1,1)?

The ID fields are very short, up to 10 chars.

math math · Jun 20, 2012

Ben Morrow said:
Ben Morrow said:

Quoth Rainer Weikusat said:

Quoth math math <[email protected]>:
I have a file with two tab delimited fields. First field is an ID, the
second field is a large string (up to hundreds of millions of
characters). The file may have many lines.

[...]

I would like to sort the file on the first (ID) field and after this
sorting, merge the second fields (i.e. remove the new lines), so that
I get a single line with many hundreds of lines that are in the same
order appended to each other as their alphabetically sorted IDs.

[...]

I would do this in two passes. Start by reading through the file a block
at a time, and finding all the ID fields. (I am assuming these are small
enough that you aren't worried about keeping the whole list in
memory).

If the file is large and entries with identical ID are not somehow
clustered, this will result in a lot of (useless) I/O.

Click to expand...

How so?

Click to expand...

Because the complete contents of the file won't fit into the kernel
page cache and this means whenever a block of data needs to be read
which presently isn't in the page cache, one of the existing blocks
needs to be evicted and the data read 'from disk'. This can happen
numerous times when randomly seeking back and forth within the file.

The same phenomenon will occur at the perl buffering level except that
it will likely be much more severe (in terms of system-call overhead)
because the perl-level buffer will likely (this may be wrong) be much
smaller than the kernel page cache.

[...]

You're still reading an entire line into memory.

Click to expand...

Can you imagine that I know that? Actually, that just about everyone
reading your text will know that?

If you want an opinion on this: If a single line of input is too large
to be kept in memory, Perl is decidedly the wrong choice for solving
this problem.

$line =~ /^([^\t]+\t)/ and $the_id = $1;
push(@{$ids->{$the_id}}, $pos + length($the_id));

Click to expand...

Ah, you are assuming IDs might be duplicated. I was assuming they were
unique, and just needed sorting. The OP will have to clarify this.

Click to expand...

The idea I got from the text of the OP was that he wanted to turn
multiple entries for a given ID into a single line while continuing to
have multiple entries for different IDs. After rereading his text, I
think that was probably wrong. An simple implementation of the
'concatenate everything in ID-sorted order' with a sensible I/O
stragey:

Hm, I don't understand right away what's really going on below in the script, I will try to decipher it.

----------------
#!/usr/bin/perl
#

use Errno qw(EMFILE ENFILE);

{
my ($in, $open, $id, %ids, @open, $input);

while ($input = <STDIN>) {
($id) = $input =~ /^([^\t]+)\t/;
chop($input);

until (defined(open($out, '+>', $id))) {
die("open: $in: $!")
unless ($! == EMFILE || $! == ENFILE) && @open;

$ids{pop(@open)} = undef;
}

print $out (substr($input, length($id) + 1));

$ids{$id} = $out;
push(@open, $id);
$out = undef;
}

for (sort(keys(%ids))) {
$in = $ids{$_};

if ($in) {
seek($in, 0, 0);
} else {
until (defined(open($in, '<', $_))) {
die("open/2: $in: $!")
unless $! == EMFILE || $! == ENFILE;

1 while $open = pop(@open) and !$ids{$open};
$open || die("noting to close");

$ids{$open} = undef;
}
}

print(<$in>);
$in = $ids{$_} = undef;
unlink($_);
}

print("\n");
}

math math · Jun 20, 2012

What have you tried?

Sorry, I should have pasted my trial code first, I did it right after my initial post, but it got here a bit late. Please see my second post for details (there is an ordering mistake in the snipper, "chomp" should be below the line with the "split")

Probably sorting it first would make it much easier:

man sort

Indeed, I tried sort first, it works, it is more of a scalability question really.

These should help with the rest:

perldoc -f open
perldoc -f chomp

Thanks.

Rainer Weikusat · Jun 20, 2012

[...]

Hm, I don't understand right away what's really going on below in the script, I will try to decipher it.

That's what I think to be the best idea here (this may be wrong): It
splits the input file into files each containing the data part of a
single line while recording the IDs. Afterwards, it sorts the IDs and
concatenates the temporary files into an output file in that order.
There's some additional logic in order to keep as many of the
temporary files open until the output stage as the OS supports.

Rainer Weikusat · Jun 20, 2012

math math said:
[...]

Probably sorting it first would make it much easier:

man sort

Click to expand...

Indeed, I tried sort first, it works, it is more of a scalability
question really.

This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.

Rainer Weikusat · Jun 20, 2012

Ben Morrow said:
The part you quoted reads through the entire file exactly once, in
order. There is no seeking.I agree there will be a lot of seeking
later,

Could you perhaps enlighthen me what the point of this remark is
supposed to be? Since I wrote about seeking and seeking is a necessary
part of the complete solution, it should be blatantly obivious that I
was referring to that.

[...]

That depends on the filesystem and OS, of course, but yes, in general I
would assume this would be the case. I'm not sure it matters: in a
process as IO-bound as this, the system call overhead is likely to be
irrelevant.

Let's assume for the sake of example that 'all of the I/O' takes 30
days. The kilotons of useless system calls take an hour. This is about
0.14% of the first time span. Would you want to wait an additional
hour for this task to complete just because it is only 0.14%? Or would
you rather want to avoid this hour as well? Can you imagine that
someone who just wants to use the code could want to avoid this hour?
Or that someone could want to do something more useful with this hour
of CPU time than 'burn it with useless system calls'?

I strongly disagree. IMHO a file with lines that long simply isn't
meaningfully a 'text file' any more, and so needs to be handled like a
binary file: read in blocks, and remember byte positions. Perl is
perfectly capable of handling binary data.

As opposed to something sensible such as memory-mapping the input
file (or that part of it which fits into the process address space),
the overhead is going to be grotesque and when 'low-level buffer
control' has to be done in any case, this overhead is not justified.

[...]

----------------
#!/usr/bin/perl
#

use Errno qw(EMFILE ENFILE);

{
my ($in, $open, $id, %ids, @open, $input);

while ($input = <STDIN>) {
($id) = $input =~ /^([^\t]+)\t/;
chop($input);

until (defined(open($out, '+>', $id))) {
die("open: $in: $!")
unless ($! == EMFILE || $! == ENFILE) && @open;

Click to expand...

Copying the data out to temporary files and then reading it back in
again is *bound* to be more IO than seeking in the original file.

No, it is not bound to be more I/O (neither physical nor 'system call
I/O') because everything is done strictly sequential and will thus
interact nicely with any intermediate buffering layers: There's never
a need to 'flush the buffer, read a different block into it, flush the
buffer again, reread the original block'. If this helps with the issue
at hand would be a different question: It certainly will if the buffer
is larger than any indivividual data item. This might not be the case
here and consequently, both approaches should be tested to determine
which one is better.

Rainer Weikusat · Jun 20, 2012

[...]

As opposed to something sensible such as memory-mapping the input
file (or that part of it which fits into the process address space),
the overhead is going to be grotesque

For sake of completeness and because I felt like doing it: Here's a
seek-based variant based on reading data in 'blocks' (of an arbitrary
size > 0) which actually works (according to my limited testing, still
without error handling).

----------------------------
#!/usr/bin/perl
#

use constant BLOCK => 4096;

sub read_block
{
my ($block, $rc);

$rc = sysread($_[0], $block, $_[1] // BLOCK);
$rc // die("sysread: $!");
$_[1] && $rc != $_[1] && die("short read");

return $rc ? $block : undef;
}

sub get_ids
{
my ($in, $ids) = @_;
my ($block, $id, $want, $bpos, $sbpos, $fpos);

$want = "\t";
while ($block = read_block($in)) {
$sbpos = $bpos = 0;

{
$bpos = index($block, $want, $sbpos);

if ($want eq "\t") {
if ($bpos != -1) {
$id .= substr($block, $sbpos, $bpos - $sbpos);
push(@$ids, [$id, $fpos + ++$bpos]);
$id = '';

$want = "\n";
$sbpos = $bpos;
redo if $sbpos < length($block);
} else {
$id .= substr($block, $sbpos);
}

last;
}

if ($want eq "\n") {
last if $bpos == -1;

push(@{$ids->[$#$ids]}, $fpos + $bpos);

$want = "\t";
$sbpos = $bpos + 1;
redo if $sbpos < length($block);
}
}

$fpos += length($block);
}
}

sub print_id_data
{
my ($fh, $id) = @_;
my ($blocks, $len);

seek($fh, $id->[1], 0);
$len = $id->[2] - $id->[1];

$blocks = int($len / BLOCK);
print(read_block($fh, BLOCK)) while ($blocks--);

$len %= BLOCK;
print(read_block($fh, $len)) if $len;
}

{
my ($fh, @ids);

open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");

get_ids($fh, \@ids);
print_id_data($fh, $_) for sort { $a->[0] cmp $b->[0] } @ids;
}

Rainer Weikusat · Jun 20, 2012

Ben Morrow said:
'IO-bound' means that for most of its runtime, the process will be
sitting there doing exactly nothing, waiting for IO to complete.

Really? Who would have guessed that ...

Rainer Weikusat · Jun 20, 2012

[...]

seek($fh, $id->[1], 0);

This should be sysseek (works here because no buffered input is ever
done on this filehandle).

Rainer Weikusat · Jun 21, 2012

[...]

----------------
#!/usr/bin/perl
#

use Errno qw(EMFILE ENFILE);

{
my ($in, $open, $id, %ids, @open, $input);

while ($input = <STDIN>) {
($id) = $input =~ /^([^\t]+)\t/;
chop($input);

until (defined(open($out, '+>', $id))) {
die("open: $in: $!")
unless ($! == EMFILE || $! == ENFILE) && @open;

Click to expand...

Copying the data out to temporary files and then reading it back in
again is *bound* to be more IO than seeking in the original file.

Click to expand...

For a small test I made, this approach is indeed hopeless but not
because of the additional data I/O but because of the overhead of
creating, deleting, opening etc 'a really large number of files'. But
my 'data parts' where tiny compared to the 'many millions' of the OP
and things might still be different for this case.

Martijn Lievaart · Jun 21, 2012

math math said:
math math said:

On 06/19/12 10:50, math math wrote:
Hi,

I have a file with two tab delimited fields. First field is an ID,
the second field is a large string (up to hundreds of millions of
characters). The file may have many lines.

I would like to sort the file on the first (ID) field and after this
sorting, merge the second fields (i.e. remove the new lines),
so that I get a single line with many hundreds of lines that are in
the same order appended to each other as their alphabetically sorted
IDs.

Click to expand...

[...]

Probably sorting it first would make it much easier:

man sort

Click to expand...

Indeed, I tried sort first, it works, it is more of a scalability
question really.

Click to expand...

This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.

As GNU sort is rather optimized, I would benchmark this before making
blanket statements like this.

Also, we don't know if efficiency is relevant. If it runs only once a
month, at night, the OP probably does not care if it takes a few hours as
opposed to a few minutes.

That said, the requirements are rather unique, so there is also a good
chance that sort will handle these files abysmally bad, chewing up all
memory and disk-I/O and effectively bringing the machine to it's knees.

So to the OP: Does sort -k1,1 run in acceptable time for you? If so,
there is the first part of your answer, the second part is now rather
trivial (or if you still run out of memory, more trivial than the
original problem).

(And if you do run out of memory, ask yourself, do I need a "works
always" solution, or does just adding more memory solve your immediate
problem)

HTH,
M4

Rainer Weikusat · Jun 21, 2012

Martijn Lievaart said:
]

This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.

Click to expand...

As GNU sort is rather optimized, I would benchmark this before making
blanket statements like this.

'Rather optimmized' usually means the code is seriously convoluted
because it used to run faster on some piece of ancient hardware in
1997 for a single test case because of that. And not matter how
'optimized', a sort program needs to sort its input. Which involves
reordering it. Completely. In case of files which are too large for
the memory of a modern computer, this involves a real lot of copying
data around.

I suggest that you make some benchmarks before making blanket
statements like the one above.

Also, we don't know if efficiency is relevant. If it runs only once a
month, at night, the OP probably does not care if it takes a few hours as
opposed to a few minutes.

Efficiency is always relevant except in a single case: The guy who has
to write the code is so busy with getting it to work at all that the
mere thought of having to try to make it work sensibly scares the shit
out of him and he tries to pass this competence-deficit as 'secret
advantage' when posing for others. Uusually, this will also always
involve a dedicated computer for testing and often, the people who are
going to use the code are not in the position to complain to the
person who wrote it, IOW, run-time efficiency doesn't matter because
it is someone elses problem.

Congratulate yourself to the happy situation you happen to be in. Stop
assuming that it is 'the universal situation'. Things might look
rather different if code is written for in-house use and supposed to
run on a computer which also provides VPN services for customers
coming from fifty different companies.

Rainer Weikusat · Jun 21, 2012

Rainer Weikusat said:
Martijn Lievaart said:

]

man sort

Indeed, I tried sort first, it works, it is more of a scalability
question really.

This is a really bad idea because sort will reorder the complete input
lines, including the data part, possible/ probably multiple times for
each input line, and this means a lot of copying of data which doesn't
need to be copied since only the IDs are supposed to be sorted.

Click to expand...

As GNU sort is rather optimized, I would benchmark this before making
blanket statements like this.

Click to expand...

'Rather optimmized' usually means the code is seriously convoluted
because it used to run faster on some piece of ancient hardware in
1997 for a single test case because of that. And not matter how
'optimized', a sort program needs to sort its input. Which involves
reordering it. Completely. In case of files which are too large for
the memory of a modern computer, this involves a real lot of copying
data around.

I suggest that you make some benchmarks before making blanket
statements like the one above.

On some random computer I just used for that, sorting a 1080M file
(4000000 lines) with sort using the first column as key and sending
output to /dev/null (average from three runs) comes out at 40.9s
wallclock/ 6.4 user/ 3.2sys. Using one of the Perl scripts I posted
(with unused code removed) to extract the ID from each input line and
sort the list of IDs takes 9.5w/ 7.8u/ 0.7s. I didn't check if sort
used temporary files for this but it doesn't really matter because
sort is guaranteed to lose out if the file is only large enough. In
this case, the volume of data it needed to deal with was 1,132,259,700
bytes while the Perl script only needed to sort 28,000,000 bytes. And
the average line length was only 283 bytes in this case, not the 'many
millions' of the original problem.

Find and count strings of text from multiple files	17	Dec 16, 2021
How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
Date in CSV/TSV question	19	Jan 1, 2013
logic question for text file updates	11	Mar 26, 2010
parsing large (>1gb) strings..how?	2	Jul 31, 2003
Aggregating/Sorting Large files	6	May 25, 2006
Searching way to assign blank delimited strings to different variables?	1	Jan 18, 2011
Efficient processing of large nuumeric data file	12	Jan 18, 2008

question concerning pipes and large strings

math math

math math

J. Gleixner

Rainer Weikusat

Rainer Weikusat

Xho Jingleheimerschmidt

Rainer Weikusat

math math

math math

math math

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

Martijn Lievaart

Rainer Weikusat

Rainer Weikusat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads