question concerning pipes and large strings

Discussion in 'Perl Misc' started by math math, Jun 19, 2012.

  1. math math

    math math Guest

    Hi,

    I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.

    I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get a single line with many hundreds of lines that are in the same order appended toeach other as their alphabetically sorted IDs.

    Is there a way to do that in PERL without reading the whole file into memory?

    Thanks
    math math, Jun 19, 2012
    #1
    1. Advertising

  2. math math

    math math Guest

    On Tuesday, June 19, 2012 4:50:18 PM UTC+1, math math wrote:
    > Hi,
    >
    > I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.
    >
    > I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get a single line with many hundreds of lines that are in the same order appended to each other as their alphabetically sorted IDs.
    >
    > Is there a way to do that in PERL without reading the whole file into memory?
    >
    > Thanks


    So my first attempt was using a pipe:

    open(my $fhd_pipe, "sort -d -k1,1 $my_input_file |" );
    open(my $fhd_out, ">test.txt" ) or die $!;
    while(<$fhd_pipe>) {
    chomp $sequence;
    my (undef, $sequence) = split("\t", $_);
    print $fhd_out $sequence;
    }
    close $fhd_out;
    close $fhd_pipe;


    But with this approach, I cannot check the return value of the first "open"command, which is quite annoying (what if something goes wrong?)
    math math, Jun 19, 2012
    #2
    1. Advertising

  3. math math

    J. Gleixner Guest

    On 06/19/12 10:50, math math wrote:
    > Hi,
    >
    > I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.
    >
    > I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get a single line with many hundreds of lines that are in the same order appended to each other as their alphabetically sorted IDs.
    >
    > Is there a way to do that in PERL without reading the whole file into memory?
    >
    > Thanks



    What have you tried?

    Probably sorting it first would make it much easier:

    man sort

    These should help with the rest:

    perldoc -f open
    perldoc -f chomp
    J. Gleixner, Jun 19, 2012
    #3
  4. math math <> writes:
    > I have a file with two tab delimited fields. First field is an ID,
    > the second field is a large string (up to hundreds of millions of
    > characters). The file may have many lines.
    >
    > I would like to sort the file on the first (ID) field and after this
    > sorting, merge the second fields (i.e. remove the new lines), so
    > that I get a single line with many hundreds of lines that are in the
    > same order appended to each other as their alphabetically sorted
    > IDs.
    >
    > Is there a way to do that in PERL without reading the whole file into memory?


    Provided it is Ok to keep all IDs in memory, something like the code
    below could be used. The basic idea is to parse the complete input
    file once line-by-line and create an array of 'ID records'. Each ID
    record is a reference to a two-element array. The first member is the
    ID itself, the second is the (stream-) position where the data part of
    this line start (sub get_ids). This array is then sorted by
    ID. Afterwards, the code goes through the sorted array, creating the
    output lines by starting a new output line whenever a new ID occurs
    and concatenating the data parts by seeking to the recorded position
    associated with an ID record, reading 'the remainder of the input line' and
    printing it (sub generate output).

    NB: For brevity, this omits all error handling. It is assumed that
    invalid input lines don't occur. Replacing the read_id_data call with
    inline code would be an obvious performance improvement. The tab
    character separating the ID from the data part is treated as part of
    the ID.

    NB^2: A probably much faster way to do this would be to go through the
    file line-by-line, use a hash to record 'seen' IDs, create a per-ID
    output file whenever a new ID is encountered, write the data part of
    each input line to the output file for its ID, add a trailing \n to
    all output files the input file was completely processed and merge the
    output files together in a final processing step.

    ------------------------
    #!/usr/bin/perl
    #

    sub get_ids
    {
    my ($in, $ids) = @_;
    my ($line, $the_id, $pos);

    $pos = tell($in);
    while ($line = <$in>) {
    $line =~ /^([^\t]+\t)/ and $the_id = $1;
    push(@$ids, [$the_id, $pos + length($the_id)]);

    $pos = tell($in);
    }
    }

    sub read_id_data
    {
    my ($fh, $id) = @_;
    my $data;

    seek($fh, $id->[1], 0);
    $data = <$fh>;
    chop($data);

    return $data;
    }

    sub generate_output
    {
    my ($fh, $ids) = @_;
    my ($last_id, $n);

    $last_id = $ids->[0][0];
    print($last_id, read_id_data($fh, $ids->[0]));

    while (++$n < @$ids) {
    if ($ids->[$n][0] ne $last_id) {
    $last_id = $ids->[$n][0];
    print("\n", $last_id);
    }

    print(read_id_data($fh, $ids->[$n]));
    }

    print("\n");
    }

    {
    my ($fh, @ids);

    open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");

    get_ids($fh, \@ids);
    @ids = sort { $a->[0] cmp $b->[0] } @ids;
    generate_output($fh, \@ids);
    }
    Rainer Weikusat, Jun 19, 2012
    #4
  5. Ben Morrow <> writes:
    > Quoth math math <>:
    >> I have a file with two tab delimited fields. First field is an ID, the
    >> second field is a large string (up to hundreds of millions of
    >> characters). The file may have many lines.


    [...]


    >> I would like to sort the file on the first (ID) field and after this
    >> sorting, merge the second fields (i.e. remove the new lines), so that
    >> I get a single line with many hundreds of lines that are in the same
    >> order appended to each other as their alphabetically sorted IDs.


    [...]

    > I would do this in two passes. Start by reading through the file a block
    > at a time, and finding all the ID fields. (I am assuming these are small
    > enough that you aren't worried about keeping the whole list in
    > memory).


    If the file is large and entries with identical ID are not somehow
    clustered, this will result in a lot of (useless) I/O.

    > For each ID, remember the start and finish offset of the second field
    > (using tell, or by keeping a running count of where you are in the file,
    > whichever's easier). Put these in a hash keyed by ID.
    >
    > Then, you can sort the list of IDs, and, for each ID, seek to the right
    > place in the old file and pull out the data item for the new file
    > directly.


    Minus some simplifications, this is a much better idea than building
    an ID array in the way I suggested (all caveats still apply).

    -----------------------
    #!/usr/bin/perl
    #

    sub get_ids
    {
    my ($in, $ids) = @_;
    my ($line, $the_id, $pos);

    $pos = tell($in);
    while ($line = <$in>) {
    $line =~ /^([^\t]+\t)/ and $the_id = $1;
    push(@{$ids->{$the_id}}, $pos + length($the_id));

    $pos = tell($in);
    }
    }

    sub read_data_at
    {
    my ($fh, $pos) = @_;
    my $data;

    seek($fh, $pos, 0);
    $data = <$fh>;
    chop($data);

    return $data;
    }

    {
    my ($fh, %ids);

    open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");

    get_ids($fh, \%ids);

    for (sort(keys(%ids))) {
    print($_);

    print(read_data_at($fh, $_)) for @{$ids{$_}};
    print("\n");
    }
    }
    Rainer Weikusat, Jun 19, 2012
    #5
  6. On 06/19/2012 08:54 AM, math math wrote:
    > On Tuesday, June 19, 2012 4:50:18 PM UTC+1, math math wrote:
    >> Hi,
    >>
    >> I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.
    >>
    >> I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get a single line with many hundreds of lines that are in the same order appended to each other as their alphabetically sorted IDs.
    >>
    >> Is there a way to do that in PERL without reading the whole file into memory?
    >>
    >> Thanks

    >
    > So my first attempt was using a pipe:
    >
    > open(my $fhd_pipe, "sort -d -k1,1 $my_input_file |" );
    > open(my $fhd_out, ">test.txt" ) or die $!;
    > while(<$fhd_pipe>) {
    > chomp $sequence;
    > my (undef, $sequence) = split("\t", $_);
    > print $fhd_out $sequence;
    > }
    > close $fhd_out;
    > close $fhd_pipe;
    >
    >
    > But with this approach, I cannot check the return value of the first "open" command


    You can (and should) but it will only tell you if the shell failed to
    start, not if the sort failed to find the named input file.


    >, which is quite annoying (what if something goes wrong?)


    You can still check the return value of the "close $fhd_pipe", which
    should fail if an error in the sort occurred. Also, either the sort or
    the shell should send a message to stderr. Whether you are likely to
    notice that depends on unspecified details.

    Xho
    Xho Jingleheimerschmidt, Jun 20, 2012
    #6
  7. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Ben Morrow <> writes:
    >> > Quoth math math <>:
    >> >> I have a file with two tab delimited fields. First field is an ID, the
    >> >> second field is a large string (up to hundreds of millions of
    >> >> characters). The file may have many lines.

    >>
    >> [...]
    >>
    >>
    >> >> I would like to sort the file on the first (ID) field and after this
    >> >> sorting, merge the second fields (i.e. remove the new lines), so that
    >> >> I get a single line with many hundreds of lines that are in the same
    >> >> order appended to each other as their alphabetically sorted IDs.

    >>
    >> [...]
    >>
    >> > I would do this in two passes. Start by reading through the file a block
    >> > at a time, and finding all the ID fields. (I am assuming these are small
    >> > enough that you aren't worried about keeping the whole list in
    >> > memory).

    >>
    >> If the file is large and entries with identical ID are not somehow
    >> clustered, this will result in a lot of (useless) I/O.

    >
    > How so?


    Because the complete contents of the file won't fit into the kernel
    page cache and this means whenever a block of data needs to be read
    which presently isn't in the page cache, one of the existing blocks
    needs to be evicted and the data read 'from disk'. This can happen
    numerous times when randomly seeking back and forth within the file.

    The same phenomenon will occur at the perl buffering level except that
    it will likely be much more severe (in terms of system-call overhead)
    because the perl-level buffer will likely (this may be wrong) be much
    smaller than the kernel page cache.

    [...]

    >> Minus some simplifications, this is a much better idea than building
    >> an ID array in the way I suggested (all caveats still apply).
    >>
    >> -----------------------
    >> #!/usr/bin/perl
    >> #
    >>
    >> sub get_ids
    >> {
    >> my ($in, $ids) = @_;
    >> my ($line, $the_id, $pos);
    >>
    >> $pos = tell($in);
    >> while ($line = <$in>) {

    >
    > You're still reading an entire line into memory.


    Can you imagine that I know that? Actually, that just about everyone
    reading your text will know that?

    If you want an opinion on this: If a single line of input is too large
    to be kept in memory, Perl is decidedly the wrong choice for solving
    this problem.


    >> $line =~ /^([^\t]+\t)/ and $the_id = $1;
    >> push(@{$ids->{$the_id}}, $pos + length($the_id));

    >
    > Ah, you are assuming IDs might be duplicated. I was assuming they were
    > unique, and just needed sorting. The OP will have to clarify this.


    The idea I got from the text of the OP was that he wanted to turn
    multiple entries for a given ID into a single line while continuing to
    have multiple entries for different IDs. After rereading his text, I
    think that was probably wrong. An simple implementation of the
    'concatenate everything in ID-sorted order' with a sensible I/O
    stragey:

    ----------------
    #!/usr/bin/perl
    #

    use Errno qw(EMFILE ENFILE);

    {
    my ($in, $open, $id, %ids, @open, $input);

    while ($input = <STDIN>) {
    ($id) = $input =~ /^([^\t]+)\t/;
    chop($input);

    until (defined(open($out, '+>', $id))) {
    die("open: $in: $!")
    unless ($! == EMFILE || $! == ENFILE) && @open;

    $ids{pop(@open)} = undef;
    }

    print $out (substr($input, length($id) + 1));

    $ids{$id} = $out;
    push(@open, $id);
    $out = undef;
    }

    for (sort(keys(%ids))) {
    $in = $ids{$_};

    if ($in) {
    seek($in, 0, 0);
    } else {
    until (defined(open($in, '<', $_))) {
    die("open/2: $in: $!")
    unless $! == EMFILE || $! == ENFILE;

    1 while $open = pop(@open) and !$ids{$open};
    $open || die("noting to close");

    $ids{$open} = undef;
    }
    }

    print(<$in>);
    $in = $ids{$_} = undef;
    unlink($_);
    }

    print("\n");
    }
    Rainer Weikusat, Jun 20, 2012
    #7
  8. math math

    math math Guest

    On Tuesday, June 19, 2012 6:55:00 PM UTC+1, Ben Morrow wrote:
    > Quoth math math <>:
    > >
    > > I have a file with two tab delimited fields. First field is an ID, the
    > > second field is a large string (up to hundreds of millions of
    > > characters). The file may have many lines.

    >
    > How large is the file altogether? A few hundred megabytes, or
    > substantially larger? It may well not be worth trying to avoid doing
    > this in memory.
    >
    > > I would like to sort the file on the first (ID) field and after this
    > > sorting, merge the second fields (i.e. remove the new lines), so that
    > > I get a single line with many hundreds of lines that are in the same
    > > order appended to each other as their alphabetically sorted IDs.

    >
    > I don't entirely understand what you mean here. If you began with a file
    > that looked like
    >
    > 002 TWO
    > 001 ONE
    > 003 THREE
    >
    > what would you expect to end up with? Perhaps you want
    >
    > ONE TWO THREE


    That's correct.

    >
    > with the IDs removed?
    >
    > > Is there a way to do that in PERL without reading the whole file into
    > > memory?

    >
    > I see other people have recommended sort(1); I would *not* recommend
    > that, in this case. sort(1) will deal with large files by spilling out
    > to temporary files on disk, but there's no need for that here.
    >


    Would sort(1) still create large files if the sort field is only on the ID field (i.e. sort -k1,1)?

    The ID fields are very short, up to 10 chars.

    > I would do this in two passes. Start by reading through the file a block
    > at a time, and finding all the ID fields. (I am assuming these are small
    > enough that you aren't worried about keeping the whole list in memory).
    >
    > For each ID, remember the start and finish offset of the second field
    > (using tell, or by keeping a running count of where you are in the file,
    > whichever's easier). Put these in a hash keyed by ID.
    >
    > Then, you can sort the list of IDs, and, for each ID, seek to the right
    > place in the old file and pull out the data item for the new file
    > directly.
    >
    > Ben
    math math, Jun 20, 2012
    #8
  9. math math

    math math Guest

    On Wednesday, June 20, 2012 2:56:56 PM UTC+1, Rainer Weikusat wrote:
    > Ben Morrow <> writes:
    > > Quoth Rainer Weikusat <>:
    > >> Ben Morrow <> writes:
    > >> > Quoth math math <>:
    > >> >> I have a file with two tab delimited fields. First field is an ID, the
    > >> >> second field is a large string (up to hundreds of millions of
    > >> >> characters). The file may have many lines.
    > >>
    > >> [...]
    > >>
    > >>
    > >> >> I would like to sort the file on the first (ID) field and after this
    > >> >> sorting, merge the second fields (i.e. remove the new lines), so that
    > >> >> I get a single line with many hundreds of lines that are in the same
    > >> >> order appended to each other as their alphabetically sorted IDs.
    > >>
    > >> [...]
    > >>
    > >> > I would do this in two passes. Start by reading through the file a block
    > >> > at a time, and finding all the ID fields. (I am assuming these are small
    > >> > enough that you aren't worried about keeping the whole list in
    > >> > memory).
    > >>
    > >> If the file is large and entries with identical ID are not somehow
    > >> clustered, this will result in a lot of (useless) I/O.

    > >
    > > How so?

    >
    > Because the complete contents of the file won't fit into the kernel
    > page cache and this means whenever a block of data needs to be read
    > which presently isn't in the page cache, one of the existing blocks
    > needs to be evicted and the data read 'from disk'. This can happen
    > numerous times when randomly seeking back and forth within the file.
    >
    > The same phenomenon will occur at the perl buffering level except that
    > it will likely be much more severe (in terms of system-call overhead)
    > because the perl-level buffer will likely (this may be wrong) be much
    > smaller than the kernel page cache.
    >
    > [...]
    >
    > >> Minus some simplifications, this is a much better idea than building
    > >> an ID array in the way I suggested (all caveats still apply).
    > >>
    > >> -----------------------
    > >> #!/usr/bin/perl
    > >> #
    > >>
    > >> sub get_ids
    > >> {
    > >> my ($in, $ids) = @_;
    > >> my ($line, $the_id, $pos);
    > >>
    > >> $pos = tell($in);
    > >> while ($line = <$in>) {

    > >
    > > You're still reading an entire line into memory.

    >
    > Can you imagine that I know that? Actually, that just about everyone
    > reading your text will know that?
    >
    > If you want an opinion on this: If a single line of input is too large
    > to be kept in memory, Perl is decidedly the wrong choice for solving
    > this problem.
    >
    >
    > >> $line =~ /^([^\t]+\t)/ and $the_id = $1;
    > >> push(@{$ids->{$the_id}}, $pos + length($the_id));

    > >
    > > Ah, you are assuming IDs might be duplicated. I was assuming they were
    > > unique, and just needed sorting. The OP will have to clarify this.

    >
    > The idea I got from the text of the OP was that he wanted to turn
    > multiple entries for a given ID into a single line while continuing to
    > have multiple entries for different IDs. After rereading his text, I
    > think that was probably wrong. An simple implementation of the
    > 'concatenate everything in ID-sorted order' with a sensible I/O
    > stragey:
    >


    Hm, I don't understand right away what's really going on below in the script, I will try to decipher it.

    > ----------------
    > #!/usr/bin/perl
    > #
    >
    > use Errno qw(EMFILE ENFILE);
    >
    > {
    > my ($in, $open, $id, %ids, @open, $input);
    >
    > while ($input = <STDIN>) {
    > ($id) = $input =~ /^([^\t]+)\t/;
    > chop($input);
    >
    > until (defined(open($out, '+>', $id))) {
    > die("open: $in: $!")
    > unless ($! == EMFILE || $! == ENFILE) && @open;
    >
    > $ids{pop(@open)} = undef;
    > }
    >
    > print $out (substr($input, length($id) + 1));
    >
    > $ids{$id} = $out;
    > push(@open, $id);
    > $out = undef;
    > }
    >
    > for (sort(keys(%ids))) {
    > $in = $ids{$_};
    >
    > if ($in) {
    > seek($in, 0, 0);
    > } else {
    > until (defined(open($in, '<', $_))) {
    > die("open/2: $in: $!")
    > unless $! == EMFILE || $! == ENFILE;
    >
    > 1 while $open = pop(@open) and !$ids{$open};
    > $open || die("noting to close");
    >
    > $ids{$open} = undef;
    > }
    > }
    >
    > print(<$in>);
    > $in = $ids{$_} = undef;
    > unlink($_);
    > }
    >
    > print("\n");
    > }
    math math, Jun 20, 2012
    #9
  10. math math

    math math Guest

    On Tuesday, June 19, 2012 5:47:28 PM UTC+1, J. Gleixner wrote:
    > On 06/19/12 10:50, math math wrote:
    > > Hi,
    > >
    > > I have a file with two tab delimited fields. First field is an ID, the second field is a large string (up to hundreds of millions of characters). The file may have many lines.
    > >
    > > I would like to sort the file on the first (ID) field and after this sorting, merge the second fields (i.e. remove the new lines), so that I get asingle line with many hundreds of lines that are in the same order appended to each other as their alphabetically sorted IDs.
    > >
    > > Is there a way to do that in PERL without reading the whole file into memory?
    > >
    > > Thanks

    >
    >
    > What have you tried?
    >


    Sorry, I should have pasted my trial code first, I did it right after my initial post, but it got here a bit late. Please see my second post for details (there is an ordering mistake in the snipper, "chomp" should be below the line with the "split")

    > Probably sorting it first would make it much easier:
    >
    > man sort
    >

    Indeed, I tried sort first, it works, it is more of a scalability question really.

    > These should help with the rest:
    >
    > perldoc -f open
    > perldoc -f chomp


    Thanks.
    math math, Jun 20, 2012
    #10
  11. math math <> writes:

    [...]

    >> An simple implementation of the
    >> 'concatenate everything in ID-sorted order' with a sensible I/O
    >> stragey:
    >>

    >
    > Hm, I don't understand right away what's really going on below in the script, I will try to decipher it.


    That's what I think to be the best idea here (this may be wrong): It
    splits the input file into files each containing the data part of a
    single line while recording the IDs. Afterwards, it sorts the IDs and
    concatenates the temporary files into an output file in that order.
    There's some additional logic in order to keep as many of the
    temporary files open until the output stage as the OS supports.
    Rainer Weikusat, Jun 20, 2012
    #11
  12. math math <> writes:
    >> On 06/19/12 10:50, math math wrote:
    >> > Hi,
    >> >
    >>> I have a file with two tab delimited fields. First field is an
    >>> ID, the second field is a large string (up to hundreds of
    >>> millions of characters). The file may have many lines.
    >>>
    >>> I would like to sort the file on the first (ID) field and after
    >>> this sorting, merge the second fields (i.e. remove the new lines),
    >>> so that I get a single line with many hundreds of lines that are
    >>> in the same order appended to each other as their alphabetically
    >>> sorted IDs.


    [...]

    >> Probably sorting it first would make it much easier:
    >>
    >> man sort
    >>

    > Indeed, I tried sort first, it works, it is more of a scalability
    > question really.


    This is a really bad idea because sort will reorder the complete input
    lines, including the data part, possible/ probably multiple times for
    each input line, and this means a lot of copying of data which doesn't
    need to be copied since only the IDs are supposed to be sorted.
    Rainer Weikusat, Jun 20, 2012
    #12
  13. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Ben Morrow <> writes:
    >> > Quoth Rainer Weikusat <>:
    >> >> Ben Morrow <> writes:
    >> >>> I would do this in two passes. Start by reading through the file a block
    >> >>> at a time, and finding all the ID fields. (I am assuming these are small
    >> >>> enough that you aren't worried about keeping the whole list in
    >> >>> memory).
    >> >>
    >> >> If the file is large and entries with identical ID are not somehow
    >> >> clustered, this will result in a lot of (useless) I/O.
    >> >
    >> > How so?

    >>
    >> Because the complete contents of the file won't fit into the kernel
    >> page cache and this means whenever a block of data needs to be read
    >> which presently isn't in the page cache, one of the existing blocks
    >> needs to be evicted and the data read 'from disk'. This can happen
    >> numerous times when randomly seeking back and forth within the file.

    >
    > The part you quoted reads through the entire file exactly once, in
    > order. There is no seeking.I agree there will be a lot of seeking
    > later,


    Could you perhaps enlighthen me what the point of this remark is
    supposed to be? Since I wrote about seeking and seeking is a necessary
    part of the complete solution, it should be blatantly obivious that I
    was referring to that.

    [...]

    >> The same phenomenon will occur at the perl buffering level except that
    >> it will likely be much more severe (in terms of system-call overhead)
    >> because the perl-level buffer will likely (this may be wrong) be much
    >> smaller than the kernel page cache.

    >
    > That depends on the filesystem and OS, of course, but yes, in general I
    > would assume this would be the case. I'm not sure it matters: in a
    > process as IO-bound as this, the system call overhead is likely to be
    > irrelevant.


    Let's assume for the sake of example that 'all of the I/O' takes 30
    days. The kilotons of useless system calls take an hour. This is about
    0.14% of the first time span. Would you want to wait an additional
    hour for this task to complete just because it is only 0.14%? Or would
    you rather want to avoid this hour as well? Can you imagine that
    someone who just wants to use the code could want to avoid this hour?
    Or that someone could want to do something more useful with this hour
    of CPU time than 'burn it with useless system calls'?

    >> >> #!/usr/bin/perl
    >> >> #
    >> >>
    >> >> sub get_ids
    >> >> {
    >> >> my ($in, $ids) = @_;
    >> >> my ($line, $the_id, $pos);
    >> >>
    >> >> $pos = tell($in);
    >> >> while ($line = <$in>) {
    >> >
    >> > You're still reading an entire line into memory.

    >>
    >> Can you imagine that I know that? Actually, that just about everyone
    >> reading your text will know that?
    >>
    >> If you want an opinion on this: If a single line of input is too large
    >> to be kept in memory, Perl is decidedly the wrong choice for solving
    >> this problem.

    >
    > I strongly disagree. IMHO a file with lines that long simply isn't
    > meaningfully a 'text file' any more, and so needs to be handled like a
    > binary file: read in blocks, and remember byte positions. Perl is
    > perfectly capable of handling binary data.


    As opposed to something sensible such as memory-mapping the input
    file (or that part of it which fits into the process address space),
    the overhead is going to be grotesque and when 'low-level buffer
    control' has to be done in any case, this overhead is not justified.

    [...]

    >> ----------------
    >> #!/usr/bin/perl
    >> #
    >>
    >> use Errno qw(EMFILE ENFILE);
    >>
    >> {
    >> my ($in, $open, $id, %ids, @open, $input);
    >>
    >> while ($input = <STDIN>) {
    >> ($id) = $input =~ /^([^\t]+)\t/;
    >> chop($input);
    >>
    >> until (defined(open($out, '+>', $id))) {
    >> die("open: $in: $!")
    >> unless ($! == EMFILE || $! == ENFILE) && @open;

    >
    > Copying the data out to temporary files and then reading it back in
    > again is *bound* to be more IO than seeking in the original file.


    No, it is not bound to be more I/O (neither physical nor 'system call
    I/O') because everything is done strictly sequential and will thus
    interact nicely with any intermediate buffering layers: There's never
    a need to 'flush the buffer, read a different block into it, flush the
    buffer again, reread the original block'. If this helps with the issue
    at hand would be a different question: It certainly will if the buffer
    is larger than any indivividual data item. This might not be the case
    here and consequently, both approaches should be tested to determine
    which one is better.
    Rainer Weikusat, Jun 20, 2012
    #13
  14. Rainer Weikusat <> writes:

    [...]

    >>> If you want an opinion on this: If a single line of input is too large
    >>> to be kept in memory, Perl is decidedly the wrong choice for solving
    >>> this problem.

    >>
    >> I strongly disagree. IMHO a file with lines that long simply isn't
    >> meaningfully a 'text file' any more, and so needs to be handled like a
    >> binary file: read in blocks, and remember byte positions. Perl is
    >> perfectly capable of handling binary data.

    >
    > As opposed to something sensible such as memory-mapping the input
    > file (or that part of it which fits into the process address space),
    > the overhead is going to be grotesque


    For sake of completeness and because I felt like doing it: Here's a
    seek-based variant based on reading data in 'blocks' (of an arbitrary
    size > 0) which actually works (according to my limited testing, still
    without error handling).

    ----------------------------
    #!/usr/bin/perl
    #

    use constant BLOCK => 4096;

    sub read_block
    {
    my ($block, $rc);

    $rc = sysread($_[0], $block, $_[1] // BLOCK);
    $rc // die("sysread: $!");
    $_[1] && $rc != $_[1] && die("short read");

    return $rc ? $block : undef;
    }

    sub get_ids
    {
    my ($in, $ids) = @_;
    my ($block, $id, $want, $bpos, $sbpos, $fpos);

    $want = "\t";
    while ($block = read_block($in)) {
    $sbpos = $bpos = 0;

    {
    $bpos = index($block, $want, $sbpos);

    if ($want eq "\t") {
    if ($bpos != -1) {
    $id .= substr($block, $sbpos, $bpos - $sbpos);
    push(@$ids, [$id, $fpos + ++$bpos]);
    $id = '';

    $want = "\n";
    $sbpos = $bpos;
    redo if $sbpos < length($block);
    } else {
    $id .= substr($block, $sbpos);
    }

    last;
    }

    if ($want eq "\n") {
    last if $bpos == -1;

    push(@{$ids->[$#$ids]}, $fpos + $bpos);

    $want = "\t";
    $sbpos = $bpos + 1;
    redo if $sbpos < length($block);
    }
    }

    $fpos += length($block);
    }
    }

    sub print_id_data
    {
    my ($fh, $id) = @_;
    my ($blocks, $len);

    seek($fh, $id->[1], 0);
    $len = $id->[2] - $id->[1];

    $blocks = int($len / BLOCK);
    print(read_block($fh, BLOCK)) while ($blocks--);

    $len %= BLOCK;
    print(read_block($fh, $len)) if $len;
    }

    {
    my ($fh, @ids);

    open($fh, '<', $ARGV[0]) // die("open: $ARGV[0]: $!");

    get_ids($fh, \@ids);
    print_id_data($fh, $_) for sort { $a->[0] cmp $b->[0] } @ids;
    }
    Rainer Weikusat, Jun 20, 2012
    #14
  15. Ben Morrow <> writes:
    > Quoth Rainer Weikusat <>:
    >> Ben Morrow <> writes:
    >> > Quoth Rainer Weikusat <>:

    >>
    >> >> The same phenomenon will occur at the perl buffering level except that
    >> >> it will likely be much more severe (in terms of system-call overhead)
    >> >> because the perl-level buffer will likely (this may be wrong) be much
    >> >> smaller than the kernel page cache.
    >> >
    >> > That depends on the filesystem and OS, of course, but yes, in general I
    >> > would assume this would be the case. I'm not sure it matters: in a
    >> > process as IO-bound as this, the system call overhead is likely to be
    >> > irrelevant.

    >>
    >> Let's assume for the sake of example that 'all of the I/O' takes 30
    >> days. The kilotons of useless system calls take an hour.

    >
    > 'IO-bound' means that for most of its runtime, the process will be
    > sitting there doing exactly nothing, waiting for IO to complete.


    Really? Who would have guessed that ...
    Rainer Weikusat, Jun 21, 2012
    #15
  16. Rainer Weikusat <> writes:

    [...]

    > seek($fh, $id->[1], 0);


    This should be sysseek (works here because no buffered input is ever
    done on this filehandle).
    Rainer Weikusat, Jun 21, 2012
    #16
  17. Rainer Weikusat <> writes:

    [...]

    >>> ----------------
    >>> #!/usr/bin/perl
    >>> #
    >>>
    >>> use Errno qw(EMFILE ENFILE);
    >>>
    >>> {
    >>> my ($in, $open, $id, %ids, @open, $input);
    >>>
    >>> while ($input = <STDIN>) {
    >>> ($id) = $input =~ /^([^\t]+)\t/;
    >>> chop($input);
    >>>
    >>> until (defined(open($out, '+>', $id))) {
    >>> die("open: $in: $!")
    >>> unless ($! == EMFILE || $! == ENFILE) && @open;

    >>
    >> Copying the data out to temporary files and then reading it back in
    >> again is *bound* to be more IO than seeking in the original file.


    For a small test I made, this approach is indeed hopeless but not
    because of the additional data I/O but because of the overhead of
    creating, deleting, opening etc 'a really large number of files'. But
    my 'data parts' where tiny compared to the 'many millions' of the OP
    and things might still be different for this case.
    Rainer Weikusat, Jun 21, 2012
    #17
  18. On Wed, 20 Jun 2012 16:29:56 +0100, Rainer Weikusat wrote:

    > math math <> writes:
    >>> On 06/19/12 10:50, math math wrote:
    >>> > Hi,
    >>> >
    >>>> I have a file with two tab delimited fields. First field is an ID,
    >>>> the second field is a large string (up to hundreds of millions of
    >>>> characters). The file may have many lines.
    >>>>
    >>>> I would like to sort the file on the first (ID) field and after this
    >>>> sorting, merge the second fields (i.e. remove the new lines),
    >>>> so that I get a single line with many hundreds of lines that are in
    >>>> the same order appended to each other as their alphabetically sorted
    >>>> IDs.

    >
    > [...]
    >
    >>> Probably sorting it first would make it much easier:
    >>>
    >>> man sort
    >>>

    >> Indeed, I tried sort first, it works, it is more of a scalability
    >> question really.

    >
    > This is a really bad idea because sort will reorder the complete input
    > lines, including the data part, possible/ probably multiple times for
    > each input line, and this means a lot of copying of data which doesn't
    > need to be copied since only the IDs are supposed to be sorted.


    As GNU sort is rather optimized, I would benchmark this before making
    blanket statements like this.

    Also, we don't know if efficiency is relevant. If it runs only once a
    month, at night, the OP probably does not care if it takes a few hours as
    opposed to a few minutes.

    That said, the requirements are rather unique, so there is also a good
    chance that sort will handle these files abysmally bad, chewing up all
    memory and disk-I/O and effectively bringing the machine to it's knees.

    So to the OP: Does sort -k1,1 run in acceptable time for you? If so,
    there is the first part of your answer, the second part is now rather
    trivial (or if you still run out of memory, more trivial than the
    original problem).

    (And if you do run out of memory, ask yourself, do I need a "works
    always" solution, or does just adding more memory solve your immediate
    problem)

    HTH,
    M4
    Martijn Lievaart, Jun 21, 2012
    #18
  19. Martijn Lievaart <> writes:
    > On Wed, 20 Jun 2012 16:29:56 +0100, Rainer Weikusat wrote:


    [...]

    >>>> man sort
    >>>>
    >>> Indeed, I tried sort first, it works, it is more of a scalability
    >>> question really.

    >>
    >> This is a really bad idea because sort will reorder the complete input
    >> lines, including the data part, possible/ probably multiple times for
    >> each input line, and this means a lot of copying of data which doesn't
    >> need to be copied since only the IDs are supposed to be sorted.

    >
    > As GNU sort is rather optimized, I would benchmark this before making
    > blanket statements like this.


    'Rather optimmized' usually means the code is seriously convoluted
    because it used to run faster on some piece of ancient hardware in
    1997 for a single test case because of that. And not matter how
    'optimized', a sort program needs to sort its input. Which involves
    reordering it. Completely. In case of files which are too large for
    the memory of a modern computer, this involves a real lot of copying
    data around.

    I suggest that you make some benchmarks before making blanket
    statements like the one above.

    > Also, we don't know if efficiency is relevant. If it runs only once a
    > month, at night, the OP probably does not care if it takes a few hours as
    > opposed to a few minutes.


    Efficiency is always relevant except in a single case: The guy who has
    to write the code is so busy with getting it to work at all that the
    mere thought of having to try to make it work sensibly scares the shit
    out of him and he tries to pass this competence-deficit as 'secret
    advantage' when posing for others. Uusually, this will also always
    involve a dedicated computer for testing and often, the people who are
    going to use the code are not in the position to complain to the
    person who wrote it, IOW, run-time efficiency doesn't matter because
    it is someone elses problem.

    Congratulate yourself to the happy situation you happen to be in. Stop
    assuming that it is 'the universal situation'. Things might look
    rather different if code is written for in-house use and supposed to
    run on a computer which also provides VPN services for customers
    coming from fifty different companies.
    Rainer Weikusat, Jun 21, 2012
    #19
  20. Rainer Weikusat <> writes:
    > Martijn Lievaart <> writes:
    >> On Wed, 20 Jun 2012 16:29:56 +0100, Rainer Weikusat wrote:

    >
    > [...]
    >
    >>>>> man sort
    >>>>>
    >>>> Indeed, I tried sort first, it works, it is more of a scalability
    >>>> question really.
    >>>
    >>> This is a really bad idea because sort will reorder the complete input
    >>> lines, including the data part, possible/ probably multiple times for
    >>> each input line, and this means a lot of copying of data which doesn't
    >>> need to be copied since only the IDs are supposed to be sorted.

    >>
    >> As GNU sort is rather optimized, I would benchmark this before making
    >> blanket statements like this.

    >
    > 'Rather optimmized' usually means the code is seriously convoluted
    > because it used to run faster on some piece of ancient hardware in
    > 1997 for a single test case because of that. And not matter how
    > 'optimized', a sort program needs to sort its input. Which involves
    > reordering it. Completely. In case of files which are too large for
    > the memory of a modern computer, this involves a real lot of copying
    > data around.
    >
    > I suggest that you make some benchmarks before making blanket
    > statements like the one above.


    On some random computer I just used for that, sorting a 1080M file
    (4000000 lines) with sort using the first column as key and sending
    output to /dev/null (average from three runs) comes out at 40.9s
    wallclock/ 6.4 user/ 3.2sys. Using one of the Perl scripts I posted
    (with unused code removed) to extract the ID from each input line and
    sort the list of IDs takes 9.5w/ 7.8u/ 0.7s. I didn't check if sort
    used temporary files for this but it doesn't really matter because
    sort is guaranteed to lose out if the file is only large enough. In
    this case, the volume of data it needed to deal with was 1,132,259,700
    bytes while the Perl script only needed to sort 28,000,000 bytes. And
    the average line length was only 283 bytes in this case, not the 'many
    millions' of the original problem.
    Rainer Weikusat, Jun 21, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Deryck
    Replies:
    12
    Views:
    704
  2. Dominic van Berkel

    newbie question concerning strings

    Dominic van Berkel, Nov 4, 2004, in forum: C++
    Replies:
    6
    Views:
    353
    Mike Wahler
    Nov 4, 2004
  3. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    735
    Malcolm
    Jun 24, 2006
  4. Karin Lagesen

    matching strings in a large set of strings

    Karin Lagesen, Apr 29, 2010, in forum: Python
    Replies:
    13
    Views:
    444
    Bryan
    May 3, 2010
  5. Helmut Jarausch
    Replies:
    3
    Views:
    310
    Dave Angel
    Apr 30, 2010
Loading...

Share This Page