best way to make a few changes in a large data file

Discussion in 'Perl Misc' started by ccc31807, Jan 8, 2013.

  1. ccc31807

    ccc31807 Guest

    My big data file looks like this:
    1,al
    2,becky
    3,carl
    4,debbie
    5,ed
    6,frieda
    .... for perhaps 200K or 300k lines

    My change file looks like this:
    5, edward
    .... for perhaps ten or twelve lines

    My script looks like this (SKIPPING THE DETAILS):
    my %big_data_hash;
    while (<BIG>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
    while (<CHANGE>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
    foreach my $id (keys %big_data_hash)
    { print OUT qq($id,$big_data_hash{$id}\n); }

    This seems wasteful to me, loading several hundred thousand lines of data in memory just to make a few changes. Is there any way to tie the data file to a hash and make the changes directly?

    Does anyone have any better ideas?

    Thanks, CC.
    ccc31807, Jan 8, 2013
    #1
    1. Advertising

  2. ccc31807 <> writes:
    > My big data file looks like this:
    > 1,al
    > 2,becky
    > 3,carl
    > 4,debbie
    > 5,ed
    > 6,frieda
    > ... for perhaps 200K or 300k lines
    >
    > My change file looks like this:
    > 5, edward
    > ... for perhaps ten or twelve lines
    >
    > My script looks like this (SKIPPING THE DETAILS):
    > my %big_data_hash;
    > while (<BIG>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
    > while (<CHANGE>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
    > foreach my $id (keys %big_data_hash)
    > { print OUT qq($id,$big_data_hash{$id}\n); }
    >
    > This seems wasteful to me, loading several hundred thousand lines of
    > data in memory just to make a few changes. Is there any way to tie
    > the data file to a hash and make the changes directly?


    For a text file, no, since insertion or removal of characters affects
    the relative positions of all characters afte the place where the
    change occured. But you algorithm could be improved: Instead of
    reading the data file and the changes file into memory completely,
    changing the 'data hash' and looping over all keys of that to generate
    the modified output, you could read the change file (which is
    presumably much smaller) into memory and then process the data file
    line by line, applying changes 'on the go' where necessary, ie,
    (uncompiled)

    my ($id, $name); # don't create new variables for every iteration

    while (<CHANGE>) {
    ($id, $name) = split /,/;
    $change_hash{$id} = $name;
    }

    while (<BIG>) {
    ($id, $name) = split /,/;
    $name = $change_hash{$id} if exists($change_hash{$id});

    print OUT qq($id,$name\n);
    }
    Rainer Weikusat, Jan 8, 2013
    #2
    1. Advertising

  3. bugbear <bugbear@trim_papermule.co.uk_trim> writes:
    > ccc31807 wrote:
    >> My big data file looks like this:
    >> 1,al
    >> 2,becky
    >> 3,carl
    >> 4,debbie
    >> 5,ed
    >> 6,frieda
    >> ... for perhaps 200K or 300k lines
    >>
    >> My change file looks like this:
    >> 5, edward
    >> ... for perhaps ten or twelve lines


    [...]

    > Any improvement would need a change in the file structure - the big win would come from NOT having to modify
    > a significant number of the disc blocks that represent the file.
    >
    > This would involve techniques such as indexing, trees of blocks, fixed size padding of the data, or having "pad" areas
    > to avoid always having to shuffle the data up and down on edits.
    >
    > I believe Oracle make a piece of software that does


    s/makes/bought/. It is called BerkeleyDB. Any of the other freely
    available 'hashed database' packages (eg, GDBM) should be usuable as
    well.
    Rainer Weikusat, Jan 8, 2013
    #3
  4. ccc31807

    Hans Mulder Guest

    On 8/01/13 17:43:46, Rainer Weikusat wrote:
    > bugbear <bugbear@trim_papermule.co.uk_trim> writes:
    >> ccc31807 wrote:
    >>> My big data file looks like this:
    >>> 1,al
    >>> 2,becky
    >>> 3,carl
    >>> 4,debbie
    >>> 5,ed
    >>> 6,frieda
    >>> ... for perhaps 200K or 300k lines
    >>>
    >>> My change file looks like this:
    >>> 5, edward
    >>> ... for perhaps ten or twelve lines

    >
    > [...]
    >
    >> Any improvement would need a change in the file structure - the big win would come from NOT having to modify
    >> a significant number of the disc blocks that represent the file.
    >>
    >> This would involve techniques such as indexing, trees of blocks, fixed size padding of the data, or having "pad" areas
    >> to avoid always having to shuffle the data up and down on edits.
    >>
    >> I believe Oracle make a piece of software that does

    >
    > s/makes/bought/. It is called BerkeleyDB. Any of the other freely
    > available 'hashed database' packages (eg, GDBM) should be usuable as
    > well.


    Oracle have also bought a product called MySQL, which may or may not be
    what bugbear was thinking of.

    -- HansM
    Hans Mulder, Jan 8, 2013
    #4
  5. ccc31807

    ccc31807 Guest

    On Tuesday, January 8, 2013 11:38:02 AM UTC-5, Rainer Weikusat wrote:
    > change occured. But you algorithm could be improved: Instead of
    > reading the data file and the changes file into memory completely,
    > changing the 'data hash' and looping over all keys of that to generate
    > the modified output, you could read the change file (which is
    > presumably much smaller) into memory and then process the data file
    > line by line, applying changes 'on the go' where necessary, ie,


    You would think so, anyway. This was the first thing I tried, and it turns out (on my setup at least) that printing the outfile line by line takes a lot longer than dumping the whole thing into memory then printing the DS once.

    I also thought of using the ID as an index to an array and tying the disk file to an array, but to be honest I was just too lazy to try it. The array would be very sparse (several 100k rows out of a potential 10m array, IDs can go as high as 99999999) and it seemed more wasteful than using a hash with only the number of keys that I actually have.

    CC.

    It's not a big deal, it wouldn't matter if it took 5 seconds to run or 5 minutes to run, as long as it produces the correct results.

    >
    > (uncompiled)
    >
    >
    >
    > my ($id, $name); # don't create new variables for every iteration
    >
    >
    >
    > while (<CHANGE>) {
    >
    > ($id, $name) = split /,/;
    >
    > $change_hash{$id} = $name;
    >
    > }
    >
    >
    >
    > while (<BIG>) {
    >
    > ($id, $name) = split /,/;
    >
    > $name = $change_hash{$id} if exists($change_hash{$id});
    >
    >
    >
    > print OUT qq($id,$name\n);
    >
    > }
    ccc31807, Jan 8, 2013
    #5
  6. ccc31807 <> writes:
    > On Tuesday, January 8, 2013 11:38:02 AM UTC-5, Rainer Weikusat wrote:
    >> change occured. But you algorithm could be improved: Instead of
    >> reading the data file and the changes file into memory completely,
    >> changing the 'data hash' and looping over all keys of that to generate
    >> the modified output, you could read the change file (which is
    >> presumably much smaller) into memory and then process the data file
    >> line by line, applying changes 'on the go' where necessary, ie,

    >
    > You would think so, anyway. This was the first thing I tried, and it
    > turns out (on my setup at least) that printing the outfile line by
    > line takes a lot longer than dumping the whole thing into memory
    > then printing the DS once.


    You are both reading and printing the output file 'line by line' at
    least insofar the (pseudo-)code you posted represented the code you
    are actually using accurately. Consequently, your statement above
    doesn't make sense, except insofar it communicates that you tried
    something which didn't work as you expected to and that you (judgeing
    from the text above) don't really have an idea why it didn't do that.

    As can be determined by some experiments, constructing a big hash and
    doing a few lookups on that is less expensive than constructing a
    small hash and doing a lot of lookups on that.

    Data file (d0) was created with

    perl -e 'for ('a' .. 'z', 'A' .. 'Z') { print ($n++.",$_", "\n"); }'

    and concatenating the output of that with itself multiple times (total
    of 468 lines), 'changes files' (d1) was

    --------
    17,X
    41,y
    22,W
    --------

    Provided relatively few replacements have to be done 'early on', the
    basic idea of using a hash of changes is indeed faster than using a
    data hash. Otherwise, things aren't that simple.

    ----------------
    use Benchmark;

    open($out, '>', '/dev/null');

    timethese(-5,
    { ccc => sub
    {
    my ($fh, %h, $id, $d);

    open($fh, '<', 'd0');
    while (<$fh>) {
    ($id, $d) = split /,/;
    $h{$id} = $d;
    }
    $fh = undef;

    open($fh, '<', 'd1');
    while (<$fh>) {
    ($id, $d) = split /,/;
    $h{$id} = $d;
    }

    for (keys(%h)) {
    print $out ($_, ',', $h{$_});
    }
    },

    sane => sub
    {
    my ($fh, %h, $id, $d, $v);

    open($fh, '<', 'd1');
    while (<$fh>) {
    ($id, $d) = split /,/;
    $h{$id} = $d;
    }
    $fh = undef;

    open($fh, '<', 'd0');
    while (<$fh>) {
    ($id, $d) = split /,/;

    $v = $h{$id};
    print $out ($id, ',', $d), next unless defined($v);

    print $out ($id, ',', $v);
    undef($h{$id});
    last unless %$h;
    }

    print $out ($_) while <$fh>;
    }});
    Rainer Weikusat, Jan 8, 2013
    #6
  7. [...]

    > As can be determined by some experiments, constructing a big hash and
    > doing a few lookups on that is less expensive than constructing a
    > small hash and doing a lot of lookups on that.


    As a quick addition to that: This is also to simplistic, because the
    original code does exactly as many hash lookups except that most of
    the are successful. Judgeing from a few more tests, using 'small
    integers' as hash keys doesn't seem to be something the perl hashing
    algorithm likes very much, eg,

    $h{17} = 'X';
    $h{22} = 'Y';

    will put (for 5.10.1) both data items in the same slot.
    Rainer Weikusat, Jan 8, 2013
    #7
  8. Ben Morrow <> writes:
    > Quoth ccc31807 <>:
    >> My big data file looks like this:
    >> 1,al
    >> 2,becky
    >> 3,carl
    >> 4,debbie
    >> 5,ed
    >> 6,frieda
    >> ... for perhaps 200K or 300k lines
    >>
    >> My change file looks like this:
    >> 5, edward
    >> ... for perhaps ten or twelve lines
    >>
    >> My script looks like this (SKIPPING THE DETAILS):
    >> my %big_data_hash;
    >> while (<BIG>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
    >> while (<CHANGE>) { my ($id, $name) = split; $big_data_hash{$id} = $name; }
    >> foreach my $id (keys %big_data_hash)
    >> { print OUT qq($id,$big_data_hash{$id}\n); }
    >>
    >> This seems wasteful to me, loading several hundred thousand lines of
    >> data in memory just to make a few changes. Is there any way to tie the
    >> data file to a hash and make the changes directly?

    >
    > If the numbers in the file are consecutive and contiguous, you could use
    > Tie::File. Otherwise you would be better off using some sort of database
    > in place of the large file.


    That's going to suffer from the same problem as the 'put the changes
    into a hash' idea I posted earlier: Searching for a particular key for
    a large number of times in a small hash, with most of these searches
    being unsuccessful, is going to be slower than building a large hash
    and (successfully) searching for a small number of keys in that. And
    since there's no way to determine the key of a particular 'big file'
    line except by reading this line (which implies reading everyting up to
    this line) and parsing it and now way to generate the output stream
    except by writing out all 'new' lines in the order they are supposed
    to appear, it won't be possible to save any I/O in this way.

    There are a number of possibilities here but without knowing more
    about the problem, it is not really possible to make sensible
    suggestion (Eg, what is supposed to be saves, memory or execution time?
    Is it possible to change the process generating the 'big files'? If
    not, how often is a file created and how often processed?).
    Rainer Weikusat, Jan 8, 2013
    #8
  9. ccc31807

    C.DeRykus Guest

    On Tuesday, January 8, 2013 10:51:11 AM UTC-8, ccc31807 wrote:
    > On Tuesday, January 8, 2013 11:38:02 AM UTC-5, Rainer Weikusat wrote:
    >
    > > change occured. But you algorithm could be improved: Instead of

    >
    > > reading the data file and the changes file into memory completely,

    >
    > > changing the 'data hash' and looping over all keys of that to generate

    >
    > > the modified output, you could read the change file (which is

    >
    > > presumably much smaller) into memory and then process the data file

    >
    > > line by line, applying changes 'on the go' where necessary, ie,

    >
    >
    >
    > You would think so, anyway. This was the first thing I tried, and it turns out (on my setup at least) that printing the outfile line by line takes alot longer than dumping the whole thing into memory then printing the DS once.
    >
    >
    >
    > I also thought of using the ID as an index to an array and tying the diskfile to an array, but to be honest I was just too lazy to try it. The array would be very sparse (several 100k rows out of a potential 10m array, IDscan go as high as 99999999) and it seemed more wasteful than using a hash with only the number of keys that I actually have.
    >
    > ...
    > It's not a big deal, it wouldn't matter if it took 5 seconds to run or 5 minutes to run, as long as it produces the correct results.
    > ...
    >


    Since speed isn't critical, the Tie::File suggestion
    would simplify the code considerably. Since the whole
    file isn't loaded, big files won't be problematic and
    any changes to the tied array will update the file
    at once. However, id's will sync to actual file line
    no's and Tie::File will automatically create empty
    lines in the file if the array is sparse.

    eg:

    use Tie::File;

    tie my @array, 'Tie::File', 'd0' or die $!;

    open(my $fh, '<', 'd1') or die $!;
    while (<$fh>) {
    chomp;
    my($id, $value) = split /,/;
    $array[$id-1] = "$id,$value";
    }

    --
    Charles DeRykus
    C.DeRykus, Jan 9, 2013
    #9
  10. "C.DeRykus" <> writes:

    [...]

    > Since speed isn't critical, the Tie::File suggestion
    > would simplify the code considerably.


    [...]

    > use Tie::File;
    >
    > tie my @array, 'Tie::File', 'd0' or die $!;
    >
    > open(my $fh, '<', 'd1') or die $!;
    > while (<$fh>) {
    > chomp;
    > my($id, $value) = split /,/;
    > $array[$id-1] = "$id,$value";
    > }


    Including 1361 lines of code stored in another file does not
    'simplify the code'. It makes it a hell lot more complicated. Assuming
    that speed doesn't matter, a simple implementation could look like
    this

    sub small
    {
    my ($fh, %chgs);

    open($fh, '<', 'd1');
    %chgs = map { split /,/ } <$fh>;

    open($fh, '<', 'd0');
    /(.*),(.*)/s, print ($1, ',', $chgs{$1} // $2) while <$fh>;
    }

    It can be argued that 'using Tie::File', provided the semantics match,
    makes the task of the programmer easier but not the code and even this
    isn't necessarily true --- Tie::File surely makes it easier to shoot
    oneself in the foot here, see 'Defferred Writing' section in the
    manpage --- because reading through more han six pages of technical
    documentation becomes then also part of the problem. But this shifts
    the issue to a different plane: The problem is no longer a technical
    one but a personal one -- how can $person with $skills get this done
    with as little (intellectual) effort as possible. And while this may
    well be a valid concern (eg, if someone has to solve the problem
    quickly for his own use) it doesn't translate to a universal
    recommendation.
    Rainer Weikusat, Jan 9, 2013
    #10
  11. ccc31807

    Ted Zlatanov Guest

    On Tue, 8 Jan 2013 10:51:11 -0800 (PST) ccc31807 <> wrote:

    c> You would think so, anyway. This was the first thing I tried, and it
    c> turns out (on my setup at least) that printing the outfile line by
    c> line takes a lot longer than dumping the whole thing into memory then
    c> printing the DS once.

    I have never experienced this. Could you, for instance, be reopening
    the change file repeatedly? Any chance you could post that slow version
    of the code?

    I would recommend, if you are stuck on the text-based data files, to
    use perl -p -e 'BEGIN { # load your change file } ... process ... }'

    This doesn't have to be a one-liner, but it's a good way to test quickly
    the "slow performance" issue. e.g.

    perl -p -e 's/^5,.*/5,edward/' myfile > myrewrite

    If that's slow, something's up.

    Ted
    Ted Zlatanov, Jan 9, 2013
    #11
  12. Rainer Weikusat <> writes:
    > "C.DeRykus" <> writes:
    >
    > [...]
    >
    >> Since speed isn't critical, the Tie::File suggestion
    >> would simplify the code considerably.

    >
    > [...]
    >
    >> use Tie::File;
    >>
    >> tie my @array, 'Tie::File', 'd0' or die $!;
    >>
    >> open(my $fh, '<', 'd1') or die $!;
    >> while (<$fh>) {
    >> chomp;
    >> my($id, $value) = split /,/;
    >> $array[$id-1] = "$id,$value";
    >> }


    [...]

    > Assuming that speed doesn't matter, a simple implementation could
    > look like this
    >
    > sub small
    > {
    > my ($fh, %chgs);
    >
    > open($fh, '<', 'd1');
    > %chgs = map { split /,/ } <$fh>;
    >
    > open($fh, '<', 'd0');
    > /(.*),(.*)/s, print ($1, ',', $chgs{$1} // $2) while <$fh>;
    > }


    As an afterthought: Instead of guessing at what's taking the time when
    executing the code above, I've instead tested it. The 'small_hash'
    implementation below (with data files constructed in the way I
    described in an earlier posting) is either faster than big_hash or
    runs at comparable speeds (tested with files up to 1004K in size). It
    can also process a 251M file which the big_hash one can't do within a
    reasonable amount of time because it first causes perl to eat all RAM
    available on the system where I tested this and then makes that go into
    'heavy thrashing' mode because 'all available RAM' is - by far - not
    enough.

    ----------------
    use Benchmark;

    open($out, '>', '/dev/null');

    timethese(-5,
    {
    big_hash => sub {
    my ($fh, %data, $k, $d);

    open($fh, '<', 'd0');
    %data = map { split /,/ } <$fh>;

    open($fh, '<', 'd1');
    while (<$fh>) {
    ($k, $d) = split /,/;
    $data{$k} = $d;
    }

    print $out ($_, ',', $data{$_}) for keys(%data);
    },

    small_hash => sub {
    my ($fh, %chgs, $k, $d);

    open($fh, '<', 'd1');
    %chgs = map { split /,/ } <$fh>;

    open($fh, '<', 'd0');
    while (<$fh>) {
    ($k, $d) = split /,/;
    print $out ($k, ',', $chgs{$k} // $d);
    }
    }});
    Rainer Weikusat, Jan 11, 2013
    #12
  13. ccc31807

    BobMCT Guest

    On Fri, 11 Jan 2013 13:56:32 +0000, Rainer Weikusat
    <> wrote:

    >Rainer Weikusat <> writes:
    >> "C.DeRykus" <> writes:

    Just a thought, but did you ever consider loading the data into a
    temporary indexed database table and 'batch' updating it using the
    indexing keys? Then you could dump the table to a flat file when
    done. You should be able to use shell commands to dump, run the php
    script, then dump the table to a file.

    Just my $.02 worth
    BobMCT, Jan 12, 2013
    #13
  14. BobMCT <> writes:
    > On Fri, 11 Jan 2013 13:56:32 +0000, Rainer Weikusat
    > <> wrote:
    >
    >>Rainer Weikusat <> writes:
    >>> "C.DeRykus" <> writes:

    > Just a thought, but did you ever consider loading the data into a
    > temporary indexed database table and 'batch' updating it using the
    > indexing keys?


    As I wrote in a reply to an earlier posting: This would be a
    perfect job for one of the available 'flat file' database packages,
    eg, DB_File. But unless the same 'base data' file is processed more
    than once, this means 'read the big file', 'write a big file', 'read
    this big file', 'write another big file' and the replacement step
    would turn into 'modify the big file'. I doubt that this would be
    worth the effort.
    Rainer Weikusat, Jan 12, 2013
    #14
  15. On 01/09/2013 06:10 AM, C.DeRykus wrote:
    >
    > Since speed isn't critical, the Tie::File suggestion would simplify
    > the code considerably. Since the whole file isn't loaded, big files
    > won't be problematic


    I haven't used it in a while, but if I recall correctly Tie::File stores
    the entire table of line-number/byte-offset in RAM, and that can often
    be about as large as storing the entire file if the lines are fairly short.

    Xho
    Xho Jingleheimerschmidt, Jan 15, 2013
    #15
  16. ccc31807

    C.DeRykus Guest

    On Monday, January 14, 2013 7:24:45 PM UTC-8, Xho Jingleheimerschmidt wrote:
    > On 01/09/2013 06:10 AM, C.DeRykus wrote:
    >
    > >

    >
    > > Since speed isn't critical, the Tie::File suggestion would simplify

    >
    > > the code considerably. Since the whole file isn't loaded, big files

    >
    > > won't be problematic

    >
    >
    >
    > I haven't used it in a while, but if I recall correctly Tie::File stores
    >
    > the entire table of line-number/byte-offset in RAM, and that can often
    >
    > be about as large as storing the entire file if the lines are fairly short.
    >
    >


    Actually IIUC, Tie::File is more parsimonious of memory than even DB_File for instance and employs a
    "lazy cache" whose size can be user-specified.

    See: http://perl.plover.com/TieFile/why-not-DB_File

    So, even with overhead of 310 bytes per record, that
    would get slow only if the file gets really huge and
    least-recently read records start to get tossed.
    But the stated aim was accuracy rather than speed.

    And, since there's a 10Mb record limit with only 200-300K records, that's unlikely to be show-stopper status. Only a couple of seconds to read a comparably sized file in my simple test.

    --
    Charles DeRykus
    C.DeRykus, Jan 15, 2013
    #16
  17. "C.DeRykus" <> writes:

    [...]

    > Tie::File is more parsimonious of memory than even DB_File for instance and employs a
    > "lazy cache" whose size can be user-specified.
    >
    > See: http://perl.plover.com/TieFile/why-not-DB_File
    >
    > So, even with overhead of 310 bytes per record, that
    > would get slow only if the file gets really huge and
    > least-recently read records start to get tossed.
    > But the stated aim was accuracy rather than speed.


    Nevertheless, Tie::File not only needs *much* more memory than a
    line-by-line processing loop (~5000 bytes vs 138M for a 63M file) but
    is also atrociously slow: Replacing 10 randomly selected lines in a
    53,248 lines file with a total size of 251K needs (on the system
    where I tested this) about 0.02s when reading but about 0.51s when
    using Tie::File (and it is probably still completely unsuitable to
    solve the original problem to begin with).
    Rainer Weikusat, Jan 15, 2013
    #17
  18. ccc31807

    C.DeRykus Guest

    On Tuesday, January 15, 2013 12:40:32 PM UTC-8, Rainer Weikusat wrote:
    > "C.DeRykus" <> writes:
    >
    >
    >
    > [...]
    >
    >
    >
    > > Tie::File is more parsimonious of memory than even DB_File for instanceand employs a

    >
    > > "lazy cache" whose size can be user-specified.

    >
    > >

    >
    > > See: http://perl.plover.com/TieFile/why-not-DB_File

    >
    > >

    >
    > > So, even with overhead of 310 bytes per record, that

    >
    > > would get slow only if the file gets really huge and

    >
    > > least-recently read records start to get tossed.

    >
    > > But the stated aim was accuracy rather than speed.

    >
    >
    >
    > Nevertheless, Tie::File not only needs *much* more memory than a
    >
    > line-by-line processing loop (~5000 bytes vs 138M for a 63M file) but
    >
    > is also atrociously slow: Replacing 10 randomly selected lines in a
    >
    > 53,248 lines file with a total size of 251K needs (on the system
    >
    > where I tested this) about 0.02s when reading but about 0.51s when
    >
    > using Tie::File (and it is probably still completely unsuitable to
    >
    > solve the original problem to begin with).


    In general I'd agree. But there's an upper bound of 10M records. If that scenario changed or some threshold was impacted, you could re-design. But, who cares here if you lose a second of runtime... or memory bumps during thatshort window. The OP said accuracy - not speed - was the objective: "it wouldn't matter if it took 5 seconds to run or 5 minutes to run, as long as it produces the correct results."

    The code becomes simpler, more intuitive, timelier. You can quickly move on.... to more pressing/interesting/challenging issues.

    --
    Charles DeRykus
    C.DeRykus, Jan 15, 2013
    #18
  19. "C.DeRykus" <> writes:

    [...]

    >> Nevertheless, Tie::File not only needs *much* more memory than a
    >> line-by-line processing loop (~5000 bytes vs 138M for a 63M file) but
    >> is also atrociously slow: Replacing 10 randomly selected lines in a
    >> 53,248 lines file with a total size of 251K needs (on the system
    >> where I tested this) about 0.02s when reading but about 0.51s when
    >> using Tie::File (and it is probably still completely unsuitable to
    >> solve the original problem to begin with).

    >
    > In general I'd agree. But there's an upper bound of 10M records. If
    > that scenario changed or some threshold was impacted, you could
    > re-design. But, who cares here if you lose a second of runtime... or
    > memory bumps during that short window. The OP said accuracy - not
    > speed - was the objective: "it wouldn't matter if it took 5 seconds
    > to run or 5 minutes to run, as long as it produces the correct
    > results."
    >
    > The code becomes simpler, more intuitive, timelier. You can quickly
    > move on... to more pressing/interesting/challenging issues.


    The code does not 'become simpler', it becomes a lot more complicated.
    Not even the 'front-end code' which needs to be written specifically for
    this is shorter than a sensible (meaning, it performs well)
    implementation without Tie::File since it was 8 lines of code in both
    cases. A 'performance doesn't matter' implementation can be shorter
    than that, as demonstrated. IMO, this is really an example of using a
    module because it exists, despite it isn't suitable for solving the
    described problem, is a lot more complicated than just using the
    facilities already provided by perl and is vastly technically inferior
    to these as well.
    Rainer Weikusat, Jan 15, 2013
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Spendius
    Replies:
    0
    Views:
    414
    Spendius
    Sep 7, 2003
  2. Replies:
    4
    Views:
    448
    Travis Newbury
    Mar 21, 2006
  3. Peter Hickman

    Best way to allocate a large amount of data

    Peter Hickman, Nov 30, 2004, in forum: C Programming
    Replies:
    6
    Views:
    444
    Dan Pop
    Dec 1, 2004
  4. Murali
    Replies:
    2
    Views:
    559
    Jerry Coffin
    Mar 9, 2006
  5. http://links.i6networks.com
    Replies:
    1
    Views:
    127
    John Bokma
    Aug 19, 2004
Loading...

Share This Page