nice parallel file reading

George Mpouras · Apr 26, 2013

# Read files in parallel. FileHandles are closed automatically.
# Files are read at every iteration circulary, hope you like it !

use strict;
use warnings;

my $Read_line = Read_files_round_robin( 'file1.txt', 'file2.txt',
'file3.txt' );

while ( my $line = $Read_line->() ) {
last if $line eq '__ALL_FILES_HAVE_BEEN_READ__';
chomp $line;
print "$line\n";
}

sub Read_files_round_robin
{
my $fc = $#_;
my @FH;
for(my $i=0; $i<@_; $i++) { open $FH[$#_ - $i] , $_[$i] or die "Could not
read file \"$_[$i]\" because \"$^E\"\n" }

sub
{
local $_ = '__ALL_FILES_HAVE_BEEN_READ__';

for (my $i=$fc; $i>=0; $i--)
{
if ( eof $FH[$i] )
{
close $FH[$i];
splice @FH, $i, 1;
next
}

$_ = readline $FH[$i];
last
}

$fc = $fc == 0 ? $#FH : $fc - 1;
$_
}
}

George Mpouras · Apr 27, 2013

# there was a problem with the code at my initial post
# Here is corrected, of how to read files like round-robin
# using an iterator

#!/usr/bin/perl
use strict;
use warnings;

my $Reader = Read_files_round_robin( 'file1.txt', 'file2.txt',
'file3.txt' );

while ( my $line = $Reader->() ) {
last if $line eq '__ALL_FILES_HAVE_BEEN_READ__';
chomp $line;
print "*$line*\n";
}

sub Read_files_round_robin
{
my @FH;
for (my $i=$#_; $i>=0; $i--) { if (open my $fh, $_[$i]) {push @FH, $fh} }
my $k = $#FH;

sub
{
until (0 == @FH)
{
for (my $i=$k--; $i>=0; $i--)
{
$k = $#FH if $k == -1;

if ( eof $FH[$i] )
{
close $FH[$i];
splice @FH, $i, 1;
$k--
}
else
{
return readline $FH[$i]
}
}
}

'__ALL_FILES_HAVE_BEEN_READ__'
}
}

Jürgen Exner · Apr 27, 2013

"George Mpouras"

# there was a problem with the code at my initial post
# Here is corrected, of how to read files like round-robin
# using an iterator

While this might be mildly interesting as an academic exercise I wonder
if there is any actual non-contrived application where you would have to
read multiple files synchronously line-by-line and at the same time the
files are too large to just load them into a variable and then process
their content.

jue

Peter J. Holzer · Apr 27, 2013

"George Mpouras"

While this might be mildly interesting as an academic exercise I wonder
if there is any actual non-contrived application where you would have to
read multiple files synchronously line-by-line and at the same time the
files are too large to just load them into a variable and then process
their content.

Not exactly like George's code, but very similar: Merge sorted files.

A similar technique could be used to implement comm(1).

hp

Jürgen Exner · Apr 27, 2013

Peter J. Holzer said:
Not exactly like George's code, but very similar: Merge sorted files.

Fair enough, but for merge sort you explicitely do _NOT_ read files
synchronously.
The only application I could think of is testing for equality of n
files.
Or implementing a poor man's database in multiple files with each column
of a table in a separate file. Which of course would be synchronization
nightmare.

jue

Rainer Weikusat · Apr 27, 2013

"George Mpouras"

# there was a problem with the code at my initial post
# Here is corrected, of how to read files like round-robin
# using an iterator
[...]

sub Read_files_round_robin
{
my @FH;
for (my $i=$#_; $i>=0; $i--) { if (open my $fh, $_[$i]) {push @FH, $fh} }
my $k = $#FH;

sub
{
until (0 == @FH)
{
for (my $i=$k--; $i>=0; $i--)
{
$k = $#FH if $k == -1;

if ( eof $FH[$i] )
{
close $FH[$i];
splice @FH, $i, 1;
$k--
}
else
{
return readline $FH[$i]
}
}
}

'__ALL_FILES_HAVE_BEEN_READ__'
}
}

Fun ways to waste your time:

----------------------
#!/usr/bin/perl
use strict;

my $Reader = Read_files_round_robin( 'file1.txt', 'wuzz', 'file2.txt', 'file3.txt');

while ( my $line = $Reader->() ) {
chomp $line;
print "*$line*\n";
}

sub Read_files_round_robin
{
my (@F, $cur);

open($F[0][@{$F[0]}], '<', $_) // --$#{$F[0]}
for @_;

return sub {
my ($fh, $l);

do {
$fh = shift(@{$F[$cur]}) or return
} until defined($l = <$fh>);

push(@{$F[$cur ^ 1]}, $fh);
$cur ^= 1 unless @{$F[$cur]};

return $l;
};
}

Rainer Weikusat · Apr 27, 2013

[...]

sub Read_files_round_robin
{
my (@F, $cur);

open($F[0][@{$F[0]}], '<', $_) // --$#{$F[0]}
for @_;

return sub {
my ($fh, $l);

do {
$fh = shift(@{$F[$cur]}) or return
} until defined($l = <$fh>);

push(@{$F[$cur ^ 1]}, $fh);
$cur ^= 1 unless @{$F[$cur]};

return $l;
};
}

While this is fairly neat, it is unfortunately broken: It is possible
that the 'current' array runs out of usable file handles but that a
usable file handle still exists in the 'next' array (eg, when the
first file is the one containing the most lines of text). This means
the 'current' array needs to be switched exactly once in this case
which, in turn, ends up making the control flow rather ugly :-( (I
tried a few variants but didn't find one I would want to post).

George Mpouras · Apr 28, 2013

push(@{$F[$cur ^ 1]}, $fh);

impressive , I have to study this !!

Rainer Weikusat · Apr 28, 2013

"George Mpouras"

push(@{$F[$cur ^ 1]}, $fh);

impressive ,

Not really. The idea to use two arrays cannot work in this way, as I
already wrote in another posting. But it is still possible to do away
with the counting loops (which are IMHO 'rather ugly', IOW, I never
use for (;;

for anything):

-----------------
sub Read_files_round_robin
{
my (@FH, $cur);

open($FH[@FH], '<', $_) // --$#FH
for @_;

$cur = -1;

return sub {
my $l;

return unless @FH;

$cur = ($cur + 1) % @FH;
$cur == @FH and --$cur
until ($l = readline($FH[$cur])) // (splice(@FH, $cur, 1), !@FH);

return $l;
};
}
------------------

It is possible to replace the

$cur == @FH and --$cur

with

$cur -= $cur == @FH

This would be a good idea in C because it would avoid a branch in favor
of an arithmetic no-op. I don't really know if this is true or false
for Perl and I'm unusure whether one or the other should be preferred
for clarity.

?

Rainer Weikusat · Apr 28, 2013

Peter J. Holzer said:
Not exactly like George's code, but very similar: Merge sorted files.

A similar technique could be used to implement comm(1).

There's also a paste utility which does round-robin merging of lines
from several input files. This would need a different EOF-handling,
though (it would need to return an empty line every time a file which
ran out of data is supposed to be read from).

Uri Guttman · May 1, 2013

JE> "George Mpouras"

JE> While this might be mildly interesting as an academic exercise I wonder
JE> if there is any actual non-contrived application where you would have to
JE> read multiple files synchronously line-by-line and at the same time the
JE> files are too large to just load them into a variable and then process
JE> their content.

not as true today but merge sorting did this very thing in the olden
days. there are probably some similar problems today.

uri

Jürgen Exner · May 1, 2013

Uri Guttman said:
JE> "George Mpouras"

JE> While this might be mildly interesting as an academic exercise I wonder
JE> if there is any actual non-contrived application where you would have to
JE> read multiple files synchronously line-by-line and at the same time the
JE> files are too large to just load them into a variable and then process
JE> their content.

not as true today but merge sorting did this very thing in the olden
days. there are probably some similar problems today.

As I mentioned in a differen message merge sort does not read
_synchronously_, i.e. round robin, from the files but for each
line/value it depends upon which file currently has the lowest
line/value and this can very well be the same file again and again for
many lines/values.

jue

Ted Zlatanov · May 6, 2013

JE> While this might be mildly interesting as an academic exercise I wonder
JE> if there is any actual non-contrived application where you would have to
JE> read multiple files synchronously line-by-line and at the same time the
JE> files are too large to just load them into a variable and then process
JE> their content.

I've had to do this. I had multiple log files being simultaneously
processed by log processors and aggregators for real-time monitoring.

(Each log processor was reading simultaneously from multiple files.)

The individual files got into the gigabytes and were frequently
rotated.

This worked pretty well, keeping up with significant amounts of traffic,
and never thrashing or ballooning memory usage. Perl 5.12.

Ted

David Combs · Jun 24, 2013

Coroutines -- would they make this task simpler?

Perl 6 will, I assume, have them; maybe even 5 does?

David

Rainer Weikusat · Jun 24, 2013

Coroutines -- would they make this task simpler?

No. Despite some amount of 'clueless rambling' on Wikipedia for this
topic (might have been changed in the meantime, I didn't check it
again after some time before the posting you're replying to was
written) an iterator/ generator is not a coroutine but an ordinary
subroutine with a single point of entry and exit, it's just a
stateful subroutine. This is essentially the same as 'an object' (in
the 'OOP' sense) with a single method and the convenient way to provide
this will usually be 'a closure' (subroutine which encloses some part
of the lexical environment it was created in).

'Single point of exit' does not refer to stuff like 'multiple return
statements in a subroutine body' but to the location where the
control-flow resumes after the subroutine has finished executing which
is always the statement after the call (Unless the code throws an
exception. This would be the common example of a subroutine with
multiple points of exit). In contrast to this, a coroutine could
yield execution to another, arbitrary coroutine at some 'random' part
of its 'function body' (multiple points of exit) and execution would
resume after the most recently performed yield (multiple points of
entry). This is really the same as 'cooperative [userspace] threading'
(something another moro^Wvery informed person is apt to re-implement
RSN whenever that last offender managed to learn why this isn't a good
idea ...).

Trouble with embedded whitespace in filenames using File::Find	24	Jan 21, 2013
reading .ini file without using a module	2	Mar 16, 2011
How to ignore 1st line in a file when reading	19	Jun 28, 2006
File locking for all my needs	0	Sep 9, 2003
How to read input data from pipe, file and files	1	Jul 9, 2007
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
FAQ 5.3 How do I count the number of lines in a file?	0	Jan 31, 2011
Do successfull poll(), sysread(), syswrite() calls clear %!	2	May 14, 2010

nice parallel file reading

George Mpouras

George Mpouras

Jürgen Exner

Peter J. Holzer

Jürgen Exner

Rainer Weikusat

Rainer Weikusat

George Mpouras

Rainer Weikusat

Rainer Weikusat

Uri Guttman

Jürgen Exner

Ted Zlatanov

David Combs

Rainer Weikusat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads