nice parallel file reading

Discussion in 'Perl Misc' started by George Mpouras, Apr 26, 2013.

  1. # Read files in parallel. FileHandles are closed automatically.
    # Files are read at every iteration circulary, hope you like it !

    use strict;
    use warnings;

    my $Read_line = Read_files_round_robin( 'file1.txt', 'file2.txt',
    'file3.txt' );

    while ( my $line = $Read_line->() ) {
    last if $line eq '__ALL_FILES_HAVE_BEEN_READ__';
    chomp $line;
    print "$line\n";
    }


    sub Read_files_round_robin
    {
    my $fc = $#_;
    my @FH;
    for(my $i=0; $i<@_; $i++) { open $FH[$#_ - $i] , $_[$i] or die "Could not
    read file \"$_[$i]\" because \"$^E\"\n" }

    sub
    {
    local $_ = '__ALL_FILES_HAVE_BEEN_READ__';

    for (my $i=$fc; $i>=0; $i--)
    {
    if ( eof $FH[$i] )
    {
    close $FH[$i];
    splice @FH, $i, 1;
    next
    }

    $_ = readline $FH[$i];
    last
    }

    $fc = $fc == 0 ? $#FH : $fc - 1;
    $_
    }
    }
    George Mpouras, Apr 26, 2013
    #1
    1. Advertising

  2. # there was a problem with the code at my initial post
    # Here is corrected, of how to read files like round-robin
    # using an iterator


    #!/usr/bin/perl
    use strict;
    use warnings;

    my $Reader = Read_files_round_robin( 'file1.txt', 'file2.txt',
    'file3.txt' );

    while ( my $line = $Reader->() ) {
    last if $line eq '__ALL_FILES_HAVE_BEEN_READ__';
    chomp $line;
    print "*$line*\n";
    }




    sub Read_files_round_robin
    {
    my @FH;
    for (my $i=$#_; $i>=0; $i--) { if (open my $fh, $_[$i]) {push @FH, $fh} }
    my $k = $#FH;

    sub
    {
    until (0 == @FH)
    {
    for (my $i=$k--; $i>=0; $i--)
    {
    $k = $#FH if $k == -1;

    if ( eof $FH[$i] )
    {
    close $FH[$i];
    splice @FH, $i, 1;
    $k--
    }
    else
    {
    return readline $FH[$i]
    }
    }
    }

    '__ALL_FILES_HAVE_BEEN_READ__'
    }
    }
    George Mpouras, Apr 27, 2013
    #2
    1. Advertising

  3. "George Mpouras"
    <> wrote:
    ># there was a problem with the code at my initial post
    ># Here is corrected, of how to read files like round-robin
    ># using an iterator


    While this might be mildly interesting as an academic exercise I wonder
    if there is any actual non-contrived application where you would have to
    read multiple files synchronously line-by-line and at the same time the
    files are too large to just load them into a variable and then process
    their content.

    jue
    Jürgen Exner, Apr 27, 2013
    #3
  4. On 2013-04-27 14:49, Jürgen Exner <> wrote:
    > "George Mpouras"
    ><> wrote:
    >># there was a problem with the code at my initial post
    >># Here is corrected, of how to read files like round-robin
    >># using an iterator

    >
    > While this might be mildly interesting as an academic exercise I wonder
    > if there is any actual non-contrived application where you would have to
    > read multiple files synchronously line-by-line and at the same time the
    > files are too large to just load them into a variable and then process
    > their content.


    Not exactly like George's code, but very similar: Merge sorted files.

    A similar technique could be used to implement comm(1).

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Apr 27, 2013
    #4
  5. "Peter J. Holzer" <> wrote:
    >On 2013-04-27 14:49, Jürgen Exner <> wrote:
    >> "George Mpouras"
    >><> wrote:
    >>># there was a problem with the code at my initial post
    >>># Here is corrected, of how to read files like round-robin
    >>># using an iterator

    >>
    >> While this might be mildly interesting as an academic exercise I wonder
    >> if there is any actual non-contrived application where you would have to
    >> read multiple files synchronously line-by-line and at the same time the
    >> files are too large to just load them into a variable and then process
    >> their content.

    >
    >Not exactly like George's code, but very similar: Merge sorted files.


    Fair enough, but for merge sort you explicitely do _NOT_ read files
    synchronously.
    The only application I could think of is testing for equality of n
    files.
    Or implementing a poor man's database in multiple files with each column
    of a table in a separate file. Which of course would be synchronization
    nightmare.

    jue
    Jürgen Exner, Apr 27, 2013
    #5
  6. "George Mpouras"
    <>
    writes:
    > # there was a problem with the code at my initial post
    > # Here is corrected, of how to read files like round-robin
    > # using an iterator


    [...]

    > sub Read_files_round_robin
    > {
    > my @FH;
    > for (my $i=$#_; $i>=0; $i--) { if (open my $fh, $_[$i]) {push @FH, $fh} }
    > my $k = $#FH;
    >
    > sub
    > {
    > until (0 == @FH)
    > {
    > for (my $i=$k--; $i>=0; $i--)
    > {
    > $k = $#FH if $k == -1;
    >
    > if ( eof $FH[$i] )
    > {
    > close $FH[$i];
    > splice @FH, $i, 1;
    > $k--
    > }
    > else
    > {
    > return readline $FH[$i]
    > }
    > }
    > }
    >
    > '__ALL_FILES_HAVE_BEEN_READ__'
    > }
    > }


    Fun ways to waste your time:

    ----------------------
    #!/usr/bin/perl
    use strict;

    my $Reader = Read_files_round_robin( 'file1.txt', 'wuzz', 'file2.txt', 'file3.txt');

    while ( my $line = $Reader->() ) {
    chomp $line;
    print "*$line*\n";
    }

    sub Read_files_round_robin
    {
    my (@F, $cur);

    open($F[0][@{$F[0]}], '<', $_) // --$#{$F[0]}
    for @_;

    return sub {
    my ($fh, $l);

    do {
    $fh = shift(@{$F[$cur]}) or return
    } until defined($l = <$fh>);

    push(@{$F[$cur ^ 1]}, $fh);
    $cur ^= 1 unless @{$F[$cur]};

    return $l;
    };
    }
    Rainer Weikusat, Apr 27, 2013
    #6
  7. Rainer Weikusat <> writes:

    [...]

    > sub Read_files_round_robin
    > {
    > my (@F, $cur);
    >
    > open($F[0][@{$F[0]}], '<', $_) // --$#{$F[0]}
    > for @_;
    >
    > return sub {
    > my ($fh, $l);
    >
    > do {
    > $fh = shift(@{$F[$cur]}) or return
    > } until defined($l = <$fh>);
    >
    > push(@{$F[$cur ^ 1]}, $fh);
    > $cur ^= 1 unless @{$F[$cur]};
    >
    > return $l;
    > };
    > }


    While this is fairly neat, it is unfortunately broken: It is possible
    that the 'current' array runs out of usable file handles but that a
    usable file handle still exists in the 'next' array (eg, when the
    first file is the one containing the most lines of text). This means
    the 'current' array needs to be switched exactly once in this case
    which, in turn, ends up making the control flow rather ugly :-( (I
    tried a few variants but didn't find one I would want to post).
    Rainer Weikusat, Apr 27, 2013
    #7
  8. push(@{$F[$cur ^ 1]}, $fh);

    impressive , I have to study this !!
    George Mpouras, Apr 28, 2013
    #8
  9. "George Mpouras"
    <>
    writes:
    > push(@{$F[$cur ^ 1]}, $fh);
    >
    > impressive ,


    Not really. The idea to use two arrays cannot work in this way, as I
    already wrote in another posting. But it is still possible to do away
    with the counting loops (which are IMHO 'rather ugly', IOW, I never
    use for (;;;) for anything):

    -----------------
    sub Read_files_round_robin
    {
    my (@FH, $cur);

    open($FH[@FH], '<', $_) // --$#FH
    for @_;

    $cur = -1;

    return sub {
    my $l;

    return unless @FH;

    $cur = ($cur + 1) % @FH;
    $cur == @FH and --$cur
    until ($l = readline($FH[$cur])) // (splice(@FH, $cur, 1), !@FH);

    return $l;
    };
    }
    ------------------

    It is possible to replace the

    $cur == @FH and --$cur

    with

    $cur -= $cur == @FH

    This would be a good idea in C because it would avoid a branch in favor
    of an arithmetic no-op. I don't really know if this is true or false
    for Perl and I'm unusure whether one or the other should be preferred
    for clarity.

    ?
    Rainer Weikusat, Apr 28, 2013
    #9
  10. "Peter J. Holzer" <> writes:
    > On 2013-04-27 14:49, Jürgen Exner <> wrote:
    >> "George Mpouras"
    >><> wrote:
    >>># there was a problem with the code at my initial post
    >>># Here is corrected, of how to read files like round-robin
    >>># using an iterator

    >>
    >> While this might be mildly interesting as an academic exercise I wonder
    >> if there is any actual non-contrived application where you would have to
    >> read multiple files synchronously line-by-line and at the same time the
    >> files are too large to just load them into a variable and then process
    >> their content.

    >
    > Not exactly like George's code, but very similar: Merge sorted files.
    >
    > A similar technique could be used to implement comm(1).


    There's also a paste utility which does round-robin merging of lines
    from several input files. This would need a different EOF-handling,
    though (it would need to return an empty line every time a file which
    ran out of data is supposed to be read from).
    Rainer Weikusat, Apr 28, 2013
    #10
  11. George Mpouras

    Uri Guttman Guest

    >>>>> "JE" == Jürgen Exner <> writes:

    JE> "George Mpouras"
    JE> <> wrote:
    >> # there was a problem with the code at my initial post
    >> # Here is corrected, of how to read files like round-robin
    >> # using an iterator


    JE> While this might be mildly interesting as an academic exercise I wonder
    JE> if there is any actual non-contrived application where you would have to
    JE> read multiple files synchronously line-by-line and at the same time the
    JE> files are too large to just load them into a variable and then process
    JE> their content.

    not as true today but merge sorting did this very thing in the olden
    days. there are probably some similar problems today.

    uri
    Uri Guttman, May 1, 2013
    #11
  12. Uri Guttman <> wrote:
    >>>>>> "JE" == Jürgen Exner <> writes:

    >
    > JE> "George Mpouras"
    > JE> <> wrote:
    > >> # there was a problem with the code at my initial post
    > >> # Here is corrected, of how to read files like round-robin
    > >> # using an iterator

    >
    > JE> While this might be mildly interesting as an academic exercise I wonder
    > JE> if there is any actual non-contrived application where you would have to
    > JE> read multiple files synchronously line-by-line and at the same time the
    > JE> files are too large to just load them into a variable and then process
    > JE> their content.
    >
    >not as true today but merge sorting did this very thing in the olden
    >days. there are probably some similar problems today.


    As I mentioned in a differen message merge sort does not read
    _synchronously_, i.e. round robin, from the files but for each
    line/value it depends upon which file currently has the lowest
    line/value and this can very well be the same file again and again for
    many lines/values.

    jue
    Jürgen Exner, May 1, 2013
    #12
  13. George Mpouras

    Ted Zlatanov Guest

    On Sat, 27 Apr 2013 07:49:40 -0700 Jürgen Exner <> wrote:

    JE> While this might be mildly interesting as an academic exercise I wonder
    JE> if there is any actual non-contrived application where you would have to
    JE> read multiple files synchronously line-by-line and at the same time the
    JE> files are too large to just load them into a variable and then process
    JE> their content.

    I've had to do this. I had multiple log files being simultaneously
    processed by log processors and aggregators for real-time monitoring.

    (Each log processor was reading simultaneously from multiple files.)

    The individual files got into the gigabytes and were frequently
    rotated.

    This worked pretty well, keeping up with significant amounts of traffic,
    and never thrashing or ballooning memory usage. Perl 5.12.

    Ted
    Ted Zlatanov, May 6, 2013
    #13
  14. George Mpouras

    David Combs Guest

    Coroutines -- would they make this task simpler?

    Perl 6 will, I assume, have them; maybe even 5 does?

    David
    David Combs, Jun 24, 2013
    #14
  15. (David Combs) writes:
    > Coroutines -- would they make this task simpler?


    No. Despite some amount of 'clueless rambling' on Wikipedia for this
    topic (might have been changed in the meantime, I didn't check it
    again after some time before the posting you're replying to was
    written) an iterator/ generator is not a coroutine but an ordinary
    subroutine with a single point of entry and exit, it's just a
    stateful subroutine. This is essentially the same as 'an object' (in
    the 'OOP' sense) with a single method and the convenient way to provide
    this will usually be 'a closure' (subroutine which encloses some part
    of the lexical environment it was created in).

    'Single point of exit' does not refer to stuff like 'multiple return
    statements in a subroutine body' but to the location where the
    control-flow resumes after the subroutine has finished executing which
    is always the statement after the call (Unless the code throws an
    exception. This would be the common example of a subroutine with
    multiple points of exit). In contrast to this, a coroutine could
    yield execution to another, arbitrary coroutine at some 'random' part
    of its 'function body' (multiple points of exit) and execution would
    resume after the most recently performed yield (multiple points of
    entry). This is really the same as 'cooperative [userspace] threading'
    (something another moro^Wvery informed person is apt to re-implement
    RSN whenever that last offender managed to learn why this isn't a good
    idea ...).
    Rainer Weikusat, Jun 24, 2013
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve
    Replies:
    3
    Views:
    293
  2. Soren
    Replies:
    4
    Views:
    1,250
    c d saunter
    Feb 14, 2008
  3. Replies:
    6
    Views:
    562
  4. Vivek Menon
    Replies:
    5
    Views:
    3,336
    Paul Uiterlinden
    Jun 8, 2011
  5. Vivek Menon
    Replies:
    0
    Views:
    1,760
    Vivek Menon
    Jun 10, 2011
Loading...

Share This Page