nice parallel file reading

G

George Mpouras

# Read files in parallel. FileHandles are closed automatically.
# Files are read at every iteration circulary, hope you like it !

use strict;
use warnings;

my $Read_line = Read_files_round_robin( 'file1.txt', 'file2.txt',
'file3.txt' );

while ( my $line = $Read_line->() ) {
last if $line eq '__ALL_FILES_HAVE_BEEN_READ__';
chomp $line;
print "$line\n";
}


sub Read_files_round_robin
{
my $fc = $#_;
my @FH;
for(my $i=0; $i<@_; $i++) { open $FH[$#_ - $i] , $_[$i] or die "Could not
read file \"$_[$i]\" because \"$^E\"\n" }

sub
{
local $_ = '__ALL_FILES_HAVE_BEEN_READ__';

for (my $i=$fc; $i>=0; $i--)
{
if ( eof $FH[$i] )
{
close $FH[$i];
splice @FH, $i, 1;
next
}

$_ = readline $FH[$i];
last
}

$fc = $fc == 0 ? $#FH : $fc - 1;
$_
}
}
 
G

George Mpouras

# there was a problem with the code at my initial post
# Here is corrected, of how to read files like round-robin
# using an iterator


#!/usr/bin/perl
use strict;
use warnings;

my $Reader = Read_files_round_robin( 'file1.txt', 'file2.txt',
'file3.txt' );

while ( my $line = $Reader->() ) {
last if $line eq '__ALL_FILES_HAVE_BEEN_READ__';
chomp $line;
print "*$line*\n";
}




sub Read_files_round_robin
{
my @FH;
for (my $i=$#_; $i>=0; $i--) { if (open my $fh, $_[$i]) {push @FH, $fh} }
my $k = $#FH;

sub
{
until (0 == @FH)
{
for (my $i=$k--; $i>=0; $i--)
{
$k = $#FH if $k == -1;

if ( eof $FH[$i] )
{
close $FH[$i];
splice @FH, $i, 1;
$k--
}
else
{
return readline $FH[$i]
}
}
}

'__ALL_FILES_HAVE_BEEN_READ__'
}
}
 
J

Jürgen Exner

"George Mpouras"
# there was a problem with the code at my initial post
# Here is corrected, of how to read files like round-robin
# using an iterator

While this might be mildly interesting as an academic exercise I wonder
if there is any actual non-contrived application where you would have to
read multiple files synchronously line-by-line and at the same time the
files are too large to just load them into a variable and then process
their content.

jue
 
P

Peter J. Holzer

"George Mpouras"


While this might be mildly interesting as an academic exercise I wonder
if there is any actual non-contrived application where you would have to
read multiple files synchronously line-by-line and at the same time the
files are too large to just load them into a variable and then process
their content.

Not exactly like George's code, but very similar: Merge sorted files.

A similar technique could be used to implement comm(1).

hp
 
J

Jürgen Exner

Peter J. Holzer said:
Not exactly like George's code, but very similar: Merge sorted files.

Fair enough, but for merge sort you explicitely do _NOT_ read files
synchronously.
The only application I could think of is testing for equality of n
files.
Or implementing a poor man's database in multiple files with each column
of a table in a separate file. Which of course would be synchronization
nightmare.

jue
 
R

Rainer Weikusat

"George Mpouras"
# there was a problem with the code at my initial post
# Here is corrected, of how to read files like round-robin
# using an iterator
[...]

sub Read_files_round_robin
{
my @FH;
for (my $i=$#_; $i>=0; $i--) { if (open my $fh, $_[$i]) {push @FH, $fh} }
my $k = $#FH;

sub
{
until (0 == @FH)
{
for (my $i=$k--; $i>=0; $i--)
{
$k = $#FH if $k == -1;

if ( eof $FH[$i] )
{
close $FH[$i];
splice @FH, $i, 1;
$k--
}
else
{
return readline $FH[$i]
}
}
}

'__ALL_FILES_HAVE_BEEN_READ__'
}
}

Fun ways to waste your time:

----------------------
#!/usr/bin/perl
use strict;

my $Reader = Read_files_round_robin( 'file1.txt', 'wuzz', 'file2.txt', 'file3.txt');

while ( my $line = $Reader->() ) {
chomp $line;
print "*$line*\n";
}

sub Read_files_round_robin
{
my (@F, $cur);

open($F[0][@{$F[0]}], '<', $_) // --$#{$F[0]}
for @_;

return sub {
my ($fh, $l);

do {
$fh = shift(@{$F[$cur]}) or return
} until defined($l = <$fh>);

push(@{$F[$cur ^ 1]}, $fh);
$cur ^= 1 unless @{$F[$cur]};

return $l;
};
}
 
R

Rainer Weikusat

[...]
sub Read_files_round_robin
{
my (@F, $cur);

open($F[0][@{$F[0]}], '<', $_) // --$#{$F[0]}
for @_;

return sub {
my ($fh, $l);

do {
$fh = shift(@{$F[$cur]}) or return
} until defined($l = <$fh>);

push(@{$F[$cur ^ 1]}, $fh);
$cur ^= 1 unless @{$F[$cur]};

return $l;
};
}

While this is fairly neat, it is unfortunately broken: It is possible
that the 'current' array runs out of usable file handles but that a
usable file handle still exists in the 'next' array (eg, when the
first file is the one containing the most lines of text). This means
the 'current' array needs to be switched exactly once in this case
which, in turn, ends up making the control flow rather ugly :-( (I
tried a few variants but didn't find one I would want to post).
 
R

Rainer Weikusat

"George Mpouras"
push(@{$F[$cur ^ 1]}, $fh);

impressive ,

Not really. The idea to use two arrays cannot work in this way, as I
already wrote in another posting. But it is still possible to do away
with the counting loops (which are IMHO 'rather ugly', IOW, I never
use for (;;;) for anything):

-----------------
sub Read_files_round_robin
{
my (@FH, $cur);

open($FH[@FH], '<', $_) // --$#FH
for @_;

$cur = -1;

return sub {
my $l;

return unless @FH;

$cur = ($cur + 1) % @FH;
$cur == @FH and --$cur
until ($l = readline($FH[$cur])) // (splice(@FH, $cur, 1), !@FH);

return $l;
};
}
------------------

It is possible to replace the

$cur == @FH and --$cur

with

$cur -= $cur == @FH

This would be a good idea in C because it would avoid a branch in favor
of an arithmetic no-op. I don't really know if this is true or false
for Perl and I'm unusure whether one or the other should be preferred
for clarity.

?
 
R

Rainer Weikusat

Peter J. Holzer said:
Not exactly like George's code, but very similar: Merge sorted files.

A similar technique could be used to implement comm(1).

There's also a paste utility which does round-robin merging of lines
from several input files. This would need a different EOF-handling,
though (it would need to return an empty line every time a file which
ran out of data is supposed to be read from).
 
U

Uri Guttman

JE> "George Mpouras"

JE> While this might be mildly interesting as an academic exercise I wonder
JE> if there is any actual non-contrived application where you would have to
JE> read multiple files synchronously line-by-line and at the same time the
JE> files are too large to just load them into a variable and then process
JE> their content.

not as true today but merge sorting did this very thing in the olden
days. there are probably some similar problems today.

uri
 
J

Jürgen Exner

Uri Guttman said:
JE> "George Mpouras"


JE> While this might be mildly interesting as an academic exercise I wonder
JE> if there is any actual non-contrived application where you would have to
JE> read multiple files synchronously line-by-line and at the same time the
JE> files are too large to just load them into a variable and then process
JE> their content.

not as true today but merge sorting did this very thing in the olden
days. there are probably some similar problems today.

As I mentioned in a differen message merge sort does not read
_synchronously_, i.e. round robin, from the files but for each
line/value it depends upon which file currently has the lowest
line/value and this can very well be the same file again and again for
many lines/values.

jue
 
T

Ted Zlatanov

JE> While this might be mildly interesting as an academic exercise I wonder
JE> if there is any actual non-contrived application where you would have to
JE> read multiple files synchronously line-by-line and at the same time the
JE> files are too large to just load them into a variable and then process
JE> their content.

I've had to do this. I had multiple log files being simultaneously
processed by log processors and aggregators for real-time monitoring.

(Each log processor was reading simultaneously from multiple files.)

The individual files got into the gigabytes and were frequently
rotated.

This worked pretty well, keeping up with significant amounts of traffic,
and never thrashing or ballooning memory usage. Perl 5.12.

Ted
 
D

David Combs

Coroutines -- would they make this task simpler?

Perl 6 will, I assume, have them; maybe even 5 does?

David
 
R

Rainer Weikusat

Coroutines -- would they make this task simpler?

No. Despite some amount of 'clueless rambling' on Wikipedia for this
topic (might have been changed in the meantime, I didn't check it
again after some time before the posting you're replying to was
written) an iterator/ generator is not a coroutine but an ordinary
subroutine with a single point of entry and exit, it's just a
stateful subroutine. This is essentially the same as 'an object' (in
the 'OOP' sense) with a single method and the convenient way to provide
this will usually be 'a closure' (subroutine which encloses some part
of the lexical environment it was created in).

'Single point of exit' does not refer to stuff like 'multiple return
statements in a subroutine body' but to the location where the
control-flow resumes after the subroutine has finished executing which
is always the statement after the call (Unless the code throws an
exception. This would be the common example of a subroutine with
multiple points of exit). In contrast to this, a coroutine could
yield execution to another, arbitrary coroutine at some 'random' part
of its 'function body' (multiple points of exit) and execution would
resume after the most recently performed yield (multiple points of
entry). This is really the same as 'cooperative [userspace] threading'
(something another moro^Wvery informed person is apt to re-implement
RSN whenever that last offender managed to learn why this isn't a good
idea ...).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,480
Members
44,900
Latest member
Nell636132

Latest Threads

Top