Handling Large files (a few Gb) in Perl

Discussion in 'Perl Misc' started by sydches@gmail.com, Jul 16, 2007.

  1. Guest

    Hi,

    I am a beginner (or worse) at Perl.

    I have a need to find the longest line (record) in a file. The below
    code works neatly for small files.
    But when I need to read huge files (in the order of Gb), it is very
    slow.

    I need to write an output file with stuff like:
    Longest line is... occurring on line number...
    There are ... lines in the file

    The same file is crunched using C in about 30 milliseconds!
    The difference in run times of Perl/VbScript and C is a significant
    one.

    Could someone help me in finding what way I could make Perl work the
    best way for processing huge files such as these?

    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    my $prev=-1;
    my $curr=0;

    my ($sec,$min,$hour,$com) = localtime(time);
    print "Start time - $hour:$min:$sec \n";

    open(F1, "c:\\perl\\syd\\del.txt");

    while (<F1>)
    {
    $curr = index($_, "\x0A");
    if($curr > $prev)
    {
    $prev = $curr;
    }
    }
    close(F1);

    my ($sec,$min,$hour,$com) = localtime(time);
    print "End time - $hour:$min:$sec \n";
    print "Lengthiest record length: $prev \n";

    The output times for a 1 Gb is
    Start time - 20:32:31
    End time - 20:34:28
    Lengthiest record length: 460

    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I am running this on a laptop with Windows XP, 1.7 GHz processor with
    1 Gb of RAM
    I am using ActivePerl

    Thanks in advance!
    Syd
     
    , Jul 16, 2007
    #1
    1. Advertising

  2. Paul Lalli Guest

    On Jul 16, 7:40 am, "" <> wrote:
    > I am a beginner (or worse) at Perl.
    >
    > I have a need to find the longest line (record) in a file. The below
    > code works neatly for small files.
    > But when I need to read huge files (in the order of Gb), it is very
    > slow.


    > Could someone help me in finding what way I could make Perl work the
    > best way for processing huge files such as these?
    >
    > XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    > my $prev=-1;
    > my $curr=0;
    >
    > my ($sec,$min,$hour,$com) = localtime(time);
    > print "Start time - $hour:$min:$sec \n";
    >
    > open(F1, "c:\\perl\\syd\\del.txt");
    >
    > while (<F1>)
    > {
    > $curr = index($_, "\x0A");


    Well here's one improvement you could make. Don't force Perl to
    search through each string looking for a specific character. Just ask
    it what the lenght of the string is. In my tests, that's about 10%
    faster:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use Benchmark qw/:all/;

    sub use_index {
    open my $fh, '<', 'ipsum.txt' or die $!;
    my $prev = 0;
    while (<$fh>) {
    my $cur = index($_, "\x0A");
    if ($cur > $prev) {
    $prev = $cur;
    }
    }
    }

    sub use_length {
    open my $fh, '<', 'ipsum.txt' or die $!;
    my $prev = 0;
    while (<$fh>) {
    my $cur = length;
    if ($cur > $prev) {
    $prev = $cur;
    }
    }
    }

    cmpthese(timethese(100_000, { length => \&use_length, index =>
    \&use_index }));
    __END__

    Benchmark: timing 100000 iterations of index, length...
    index: 26 wallclock secs (19.81 usr + 6.27 sys = 26.08 CPU) @
    3834.36/s (n=100000)
    length: 24 wallclock secs (17.10 usr + 6.47 sys = 23.57 CPU) @
    4242.68/s (n=100000)
    Rate index length
    index 3834/s -- -10%
    length 4243/s 11% --



    Paul Lalli
     
    Paul Lalli, Jul 16, 2007
    #2
    1. Advertising

  3. Guest

    Hi Paul,

    Thank you for your suggestion. This does speed up things quite a bit.

    Is there any other way to speed this up much faster? It is still slow
    and my pc hangs up on me, for large files.

    Warm regards!
    Syd
     
    , Jul 16, 2007
    #3
  4. J. Gleixner Guest

    wrote:
    > Hi Paul,
    >
    > Thank you for your suggestion. This does speed up things quite a bit.
    >
    > Is there any other way to speed this up much faster?


    Write it in C.

    >It is still slow
    > and my pc hangs up on me, for large files.


    It shouldn't 'hang' your PC. It might use a large percentage of
    the CPU though.
     
    J. Gleixner, Jul 16, 2007
    #4
  5. Guest

    "" <> wrote:
    > Hi,
    >
    > I am a beginner (or worse) at Perl.
    >
    > I have a need to find the longest line (record) in a file. The below
    > code works neatly for small files.
    > But when I need to read huge files (in the order of Gb), it is very
    > slow.
    >
    > I need to write an output file with stuff like:
    > Longest line is... occurring on line number...
    > There are ... lines in the file
    >
    > The same file is crunched using C in about 30 milliseconds!


    I can't get anywhere near that speed in C. Can you post your C code,
    and some Perl code that generates a sample file to be operated on?


    > The difference in run times of Perl/VbScript and C is a significant
    > one.
    >
    > Could someone help me in finding what way I could make Perl work the
    > best way for processing huge files such as these?


    Since you already have a C program which works, use it.


    >
    > XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    > my $prev=-1;
    > my $curr=0;
    >
    > my ($sec,$min,$hour,$com) = localtime(time);
    > print "Start time - $hour:$min:$sec \n";
    >
    > open(F1, "c:\\perl\\syd\\del.txt");
    >
    > while (<F1>)
    > {
    > $curr = index($_, "\x0A");


    You are searching for the end of line marker twice, once implicitly in the
    readline (<F1>) and once here. And you already know where it will be
    found the second time--either at the end, or no where.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jul 16, 2007
    #5
  6. On 2007-07-16 11:40, <> wrote:
    > I have a need to find the longest line (record) in a file. The below
    > code works neatly for small files.
    > But when I need to read huge files (in the order of Gb), it is very
    > slow.
    >
    > I need to write an output file with stuff like:
    > Longest line is... occurring on line number...
    > There are ... lines in the file
    >
    > The same file is crunched using C in about 30 milliseconds!

    [...]
    > I am running this on a laptop with Windows XP, 1.7 GHz processor with
    > 1 Gb of RAM


    I don't believe that. Since you have only 1 GB RAM, you can't keep a
    1 GB file completely in memory. And you can't read a 1 GB file from disk
    in 30 milliseconds - certainly not from a laptop hard disk (30 seconds
    sounds more likely). Even if the whole file is cached in RAM I think
    that you can't scan 1GB of RAM in 30 ms (The new Power6 CPU claims a
    *maximum* memory read bandwidth of 40 GB/s - theoretically enough to
    scan 1 GB in 25 ms, but I doubt you get even close to that number in
    practice). My best attempt takes about 2 seconds user time (1.85 GHz
    Core2). I won't be surprised if somebody can improve this by an order of
    magnitude, but anything more requires serious magic.

    Just for comparison. Your script takes about 20.5 seconds on my system.
    The obvious optimization (using length instead of index) brings it down
    to 19.3 seconds. A naive portable C version (using stdio) is about as
    fast as your script (21.0 seconds), and a naive C version using mmap
    and strtok is much slower (37.4 seconds), but very much reduces CPU
    time. I guess by combining low level I/O calls (maybe even async I/O)
    and strtok I could get close to 15 seconds, which should be just about
    possible with the disk I have.

    hp

    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
     
    Peter J. Holzer, Jul 16, 2007
    #6
  7. Guest

    Hi,

    My apologies. The times were off the mark. It takes less than 3
    seconds (277 milliseconds).
    The start time is: 14:04:35.97
    The end time is: 14:04:38.74

    Here's the source code in C.
    ***************************************Begin***************************************
    # include <stdio.h>
    # include <conio.h>
    # include <fstream.h>
    # include <time.h>

    //-------------------------------------------------------------------------//
    // This program reads a file - terminated by a carriage return and
    reports //
    // the length of the longest record in a
    file. //
    //-------------------------------------------------------------------------//

    int main ( int argc, char *argv[] );
    void handle ( char input_file_name[], int *wide_line_width,
    int *wide_line_number);
    void timestamp ( void );

    int main ( int argc, char *argv[] )
    {
    int i;
    char input_file_name[80];
    int wide_line_number;
    int wide_line_width;

    clrscr();

    textattr(6 + ((1) << 5));
    highvideo();
    cprintf("\n\n\n\n");
    cprintf("
    ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ");
    cprintf(" ³ CHECK THE MAX RECORD LENGTH IN A DOS/PC
    FILE ³ ");
    cprintf("
    ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ ");
    printf("\n");

    if ( argc < 2 )
    {
    cout << " Enter the input file name:\n";
    cout << "\n ";
    cin.getline ( input_file_name, sizeof ( input_file_name ) );

    cout << "\n";
    cout << " Started - ";
    timestamp ( );

    handle ( input_file_name, &wide_line_width, &wide_line_number);

    cout << "\n";
    cout << " The longest line of \"" << input_file_name
    << "\" has length " << wide_line_width;
    }
    else
    {
    for ( i = 1 ; i < argc ; ++i )
    {
    handle ( argv, &wide_line_width, &wide_line_number);

    cout << " The longest line of \"" << argv
    << "\" has length " << wide_line_width;
    }
    }
    cout << "\n";
    cout << " Ended - ";
    timestamp ( );

    textattr(6 + ((1) << 5));
    highvideo();
    cprintf("\n\n\n\n");
    cprintf("
    ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ");
    cprintf(" ³ WRITTEN BY VINAY
    MAKAM ³ ");
    cprintf("
    ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ ");

    getchar();
    return 0;
    }
    //-------------------------------------------------------------------------//
    void handle ( char input_file_name[], int *wide_line_width,
    int *wide_line_number )
    {
    int big_number;
    int big_width;
    char c;
    ifstream input_file;
    int input_file_width;
    int line_number;
    int line_width;

    big_width = -1;
    big_number = -1;

    input_file.open( input_file_name );

    if ( !input_file )
    {
    cout << "\n";
    cout << "Fatal error!\n";
    cout << " Cannot open the input file " << input_file << ".\n";
    return;
    }

    big_width = 0;
    line_width = 0;
    line_number = 0;

    while ( 1 )
    {
    input_file.get ( c );

    if ( input_file.eof ( ) )
    {
    break;
    }

    if ( c == '\n' )
    {
    line_number = line_number + 1;

    if ( big_width < line_width )
    {
    big_width = line_width;
    big_number = line_number;
    }
    line_width = 0;
    }
    else
    {
    line_width = line_width + 1;
    }
    }

    input_file.close ( );

    *wide_line_width = big_width;
    *wide_line_number = big_number;

    return;
    }
    //-------------------------------------------------------------------------//
    void timestamp ()
    {
    #define TIME_SIZE 40

    static char time_buffer[TIME_SIZE];
    const struct tm *tm;
    size_t len;
    time_t now;

    now = time ( NULL );
    tm = localtime ( &now );

    len = strftime ( time_buffer, TIME_SIZE, " %I:%M:%S %p", tm );
    len = len ;
    cout << time_buffer << "\n";
    return ;
    #undef TIME_SIZE
    }
    ****************************************End****************************************

    I am generating test files by copying a reasonable large file
    iteratively, in DOS.
    copy TestFile + TestFile TestFileDoubled

    Thank you very much for all the suggestions!
    Syd
     
    , Jul 17, 2007
    #7
  8. Mirco Wahab Guest

    wrote:
    > Hi,
    >
    > My apologies. The times were off the mark. It takes less than 3
    > seconds (277 milliseconds).
    > The start time is: 14:04:35.97
    > The end time is: 14:04:38.74
    >
    > Here's the source code in C.


    ....

    > while ( 1 )
    > {
    > input_file.get ( c );
    >
    > if ( input_file.eof ( ) )
    > {
    > break;
    > }
    >
    > if ( c == '\n' )
    > {
    > line_number = line_number + 1;
    >
    > if ( big_width < line_width )
    > {
    > big_width = line_width;
    > big_number = line_number;
    > }
    > line_width = 0;
    > }
    > else
    > {
    > line_width = line_width + 1;
    > }
    > }
    >
    > input_file.close ( );



    This is entirely impossible. I guess your
    C++ "test situation" doesn't touch the 1G
    file at all. Your 0,3 msec or 3,0 sec are
    the time needed to load the application
    into ram - which terminates after startup.
    Thats it (possibly).

    The perl solution *does* obviously check
    each line and returns the expected result.

    My fast-hacked C solution reads a 1G file
    in ~28 sec (2M Lines, mean length 500 Bytes)
    on a Athlon64/3200 1Gig WinXP.

    my 0,02 €

    Regards

    M.
     
    Mirco Wahab, Jul 17, 2007
    #8
  9. Mirco Wahab Guest

    wrote:
    > Hi,
    >
    > I am a beginner (or worse) at Perl.
    >
    > I have a need to find the longest line (record) in a file. The below
    > code works neatly for small files.
    > But when I need to read huge files (in the order of Gb), it is very
    > slow.
    >
    > I am running this on a laptop with Windows XP, 1.7 GHz processor with
    > 1 Gb of RAM
    > I am using ActivePerl


    Id did some test on a linux machine (Athlon XP/2500+, 1Gig)
    to clear this up.

    First, is put the stuff on the good ole Maxtor server
    drive, generate the 1G file and run the perl program:

    winner: 1002 at 795

    real 0m31.034s
    user 0m7.080s
    sys 0m2.350s

    The real perl process did need 7 seconds then + 2 seconds
    operating system file handling. The difference up to the
    whole time of 31 sec is the time needed to get the file
    from the raw disk drive (a 1G file won't be buffered).


    Next, I move the dir to a new WD server drive, generate
    the 1G file an run the perl program:

    winner: 1002 at 200

    real 0m15.603s
    user 0m7.290s
    sys 0m2.030s


    What we see here - the whole time has been halved,
    but the time the perl needed is almost exactly the
    same. We measured the different disk bandwidths.


    Regards

    M.

    Perl source used:

    ==>
    use strict;
    use warnings;

    my ($l, $n) = (-1, -1);

    open my $fh, '<', 'del.txt' or die "can't do anything $!";
    while( <$fh> ) {
    ($l, $n) = (length, $.) if $l < length
    }
    close $fh;

    print "winner: $l at $n\n";
    <==
     
    Mirco Wahab, Jul 17, 2007
    #9
  10. Guest

    Hi,

    I think it's only fair that we do this on the same file.

    ***************************************Begin*******************************­
    ********
    open(F1, ">c:/perl/testout/HugeFile.txt");

    for ($index = 0; $index <= 1000000; $index++)
    {
    print F1 "This line is 32 characters long \n";
    print F1 "This line
    is
    101 characters long \n";
    }
    close F1;
    ****************************************End********************************­
    ********

    The C code times are as follows:
    The current time is: 17:30:47.20
    The current time is: 17:30:55.00

    Thanks!
    Sydney
     
    , Jul 17, 2007
    #10
  11. Mirco Wahab Guest

    wrote:
    > Hi,
    >
    > I think it's only fair that we do this on the same file.
    > open(F1, ">c:/perl/testout/HugeFile.txt");
    >
    > for ($index = 0; $index <= 1000000; $index++)
    > {
    > print F1 "This line is 32 characters long \n";
    > print F1 "This line
    > is
    > 101 characters long \n";
    > }
    > close F1;


    Your file size will be

    1000000 * (32 + 101) ==> 133000000

    which is almost 128 MB or 0.128 GB

    Try:

    ==>

    use strict;
    use warnings;

    my $count = 10_000_000;

    open my $fh, '>', 'del.txt' or die "can't write $!";

    print $fh (
    'This line is 32 characters long
    ..................................... This line is 101 characters long ..............................
    ' ) while $count--;

    close $fh;

    <==

    This will result in a file of 128GB. Post your C results then.

    Regards

    M.
     
    Mirco Wahab, Jul 17, 2007
    #11
  12. Mirco Wahab Guest

    Mirco Wahab wrote:
    >
    > This will result in a file of 128GB. Post your C results then.


    OOPS, lost the dot:

    must read: "This will result in a file of 1.28GB"

    Sorry, M.
     
    Mirco Wahab, Jul 17, 2007
    #12
  13. Mumia W. Guest

    On 07/17/2007 03:47 AM, wrote:
    > Hi,
    >
    > My apologies. The times were off the mark. It takes less than 3
    > seconds (277 milliseconds).
    > The start time is: 14:04:35.97
    > The end time is: 14:04:38.74
    > [...]


    No, this is not 277 milliseconds; it's 2.77 seconds or 2770 milliseconds.

    > Here's the source code in C. [...]


    No, it's C++.

    > # include <fstream.h>
    > [...]


    You must have a fast machine. Either that or your program is buggy.

    I have a 1300mhz AMD CPU with 512mb. I wrote both a C and a Perl version
    of this program, and this is what I got:

    > $ ls -ln ~/tmp/junk/big
    > -rw-r--r-- 1 **** **** 2088763392 2007-07-17 06:33 /home/****/tmp/junk/big
    > $
    > $ cat count-lines.c
    >
    > #include <stdio.h>
    > #include <stdlib.h>
    > #include <time.h>
    > #include <string.h>
    >
    > int main (int argc, const char ** argv)
    > {
    > const char * filename = 0;
    > FILE * handle = 0;
    > time_t starttime = 0;
    > time_t endtime = 0;
    > long line_number = 0;
    > long line_length = 0;
    > long lnno = 0;
    > static char line [10000];
    >
    > if (argc < 2) {
    > fprintf(stderr, "No filename\n");
    > return EXIT_FAILURE;
    > }
    >
    > filename = argv[1];
    > handle = fopen(filename, "r");
    > if (0 == handle) {
    > perror(filename);
    > return EXIT_FAILURE;
    > }
    >
    > starttime = time(0);
    >
    > while (fgets(line,sizeof(line),handle)) {
    > lnno++;
    > long length = strlen(line);
    > if (length > line_length) {
    > line_length = length;
    > line_number = lnno;
    > }
    > }
    > fclose(handle);
    >
    > endtime = time(0);
    >
    > printf("%ld is the longest line with %ld characters\n",
    > line_number, line_length);
    > printf("%ld seconds elapsed\n", (long) (endtime-starttime));
    >
    > return EXIT_SUCCESS;
    > }
    >
    >
    > $ cat count-lines.pl
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > my ($line_number, $line_length) = (0,0);
    > my $starttime = time();
    >
    > while (<>) {
    > my $length = length($_);
    > if ($length > $line_length) {
    > $line_length = $length;
    > $line_number = $.;
    > }
    > }
    > close(ARGV);
    >
    > my $endtime = time();
    >
    > print "$line_number is the longest line with $line_length characters\n";
    > printf "%d seconds elapsed\n", ($endtime-$starttime);
    >
    > $
    > $
    > $ ./count-lines ~/tmp/junk/big
    > 9580 is the longest line with 3836 characters
    > 51 seconds elapsed
    > $
    > $ ./count-lines.pl ~/tmp/junk/big
    > 9580 is the longest line with 3836 characters
    > 106 seconds elapsed
    > $
    > $ # For a bytecode-compiled scripting language, that's pretty damn good!
    > $


    I expected Perl to take ten to twenty times longer than C. I'm amazed
    that it's only about twice as slow. The fact that Perl can almost keep
    up with C means that Perl is ultra-efficient with character processing :-D

    However, your time of 2.77 seconds stretches my belief muscles too far.
    What kind of machine are you running on?

    PS.
    I was using the ext3 filesystem during the test. I can probably get much
    better results by using ext2 if I'm willing to forgo filesystem
    journalizing--which I'm not willing to do.
     
    Mumia W., Jul 17, 2007
    #13
  14. Mumia W. Guest

    On 07/17/2007 07:27 AM, Mumia W. wrote:
    > On 07/17/2007 03:47 AM, wrote:
    >> Hi, [ program snipped ]
    >>

    >
    > However, your time of 2.77 seconds stretches my belief muscles too far.
    > What kind of machine are you running on?
    > [...]


    Sorry about that. Of course your data is not my data, and of course some
    people will have machines that are 100 times faster than mine.
     
    Mumia W., Jul 17, 2007
    #14
  15. Mumia W. Guest

    On 07/17/2007 07:19 AM, Mirco Wahab wrote:
    > wrote:
    >> Hi,
    >>
    >> I think it's only fair that we do this on the same file.
    >> open(F1, ">c:/perl/testout/HugeFile.txt");
    >>
    >> for ($index = 0; $index <= 1000000; $index++)
    >> {
    >> print F1 "This line is 32 characters long \n";
    >> print F1 "This line
    >> is
    >> 101 characters long \n";
    >> }
    >> close F1;

    >
    > Your file size will be
    >
    > 1000000 * (32 + 101) ==> 133000000
    >
    > which is almost 128 MB or 0.128 GB
    >
    > Try:
    >
    > ==>
    >
    > use strict;
    > use warnings;
    >
    > my $count = 10_000_000;
    >
    > open my $fh, '>', 'del.txt' or die "can't write $!";
    >
    > print $fh (
    > 'This line is 32 characters long
    > ..................................... This line is 101 characters long
    > ..............................
    > ' ) while $count--;
    >
    > close $fh;
    >
    > <==
    >
    > This will result in a file of 128GB. Post your C results then.
    >
    > Regards
    >
    > M.


    My timing for the C program is similar to yours (same data with a
    different program).

    > $ (cd ~/tmp/junk ; ls -ln del.txt)
    > -rw-r--r-- 1 **** **** 1340000000 2007-07-17 07:55 del.txt
    > $
    > $ ./count-lines ~/tmp/junk/del.txt
    > 2 is the longest line with 102 characters
    > 26 seconds elapsed
    > $
    > $ ./count-lines.pl ~/tmp/junk/del.txt
    > 2 is the longest line with 102 characters
    > 38 seconds elapsed
    > $


    "Count-lines" is the C program's binary, and count-lines.pl is,
    obviously, the Perl program.

    I'm still impressed by Perl's speed.

    Yes, I know the wording of my program's output needs work. Line 2 is
    only one of the 10 million longest lines in the file ;-)
     
    Mumia W., Jul 17, 2007
    #15
  16. Guest

    Hi,

    Using the Perl code that Mirco gave, I created a file which is 1.27 GB
    in size.

    Ran the C++ code and the times are:
    The current time is: 19:18:52.21
    The current time is: 19:23:09.65

    I am on a Windows XP system (1.7 GHz processor with 1 Gb of RAM), and
    using ActivePerl.

    Thanks!
    Syd
     
    , Jul 17, 2007
    #16
  17. Mirco Wahab Guest

    wrote:
    > Using the Perl code that Mirco gave, I created a file which is 1.27 GB
    > in size.
    > Ran the C++ code and the times are:
    > The current time is: 19:18:52.21
    > The current time is: 19:23:09.65


    OK, this is 256 seconds which is what
    one may expect from a loaded win-xp
    machine of 1,7GHz.

    I did again another test with the very file
    (1,2GB) on an old unix machine (Athlon XP/2500+,
    running from a WD3200JB-ext3).

    After compiling (gcc 4.1) your C++-Programm (commenting out
    non-unixish stuff) with: g++ -O3 -o sydches sydches.cxx

    I see the following results:
    $> time ./sydches del.txt

    ....

    real 1m1.888s
    user 0m54.270s
    sys 0m3.070s

    (the whole process takes ~62sec) - whereas the short Perl
    script provided in another post shows the following:
    $> time perl longest.pl

    ....

    real 0m28.218s
    user 0m21.140s
    sys 0m3.330s

    (which is more than twice as fast). Some Perl program:

    ...
    open my $fh, '<', 'del.txt' or die "can't do anything $!";
    while( <$fh> ) {
    ($l, $n) = (length, $.) if $l < length
    }
    close $fh;
    ...

    may be therefore, as one can see, much
    much faster than a 'non-optimally' written
    C/C++ program.

    Regards

    Mirco
     
    Mirco Wahab, Jul 17, 2007
    #17
  18. Guest

    Hi,

    I started running both my programs (C++ and Perl) on a whole lot of
    test files. And I noticed something.

    The C code and Perl code run in almost the same time!
    Except a single 1 Gb file which the C code does in around 250
    milliseconds, and Perl takes about 2-3 minutes for that!
    The output says 460 is the longest record.

    Unfortunately, I am unable to open this file directly (because of the
    size).
    I am going to try a splitter to see what kind of data is on this one
    file.

    Thanks for all the help. Boy, did I learn a whole lot talking to you
    guys!

    Warm regards!
    Syd
     
    , Jul 18, 2007
    #18
  19. Guest

    "" <> wrote:
    > Hi,
    >
    > I started running both my programs (C++ and Perl) on a whole lot of
    > test files. And I noticed something.
    >
    > The C code and Perl code run in almost the same time!
    > Except a single 1 Gb file which the C code does in around 250
    > milliseconds, and Perl takes about 2-3 minutes for that!
    > The output says 460 is the longest record.
    >
    > Unfortunately, I am unable to open this file directly (because of the
    > size).
    > I am going to try a splitter to see what kind of data is on this one
    > file.


    I wonder if Perl is somehow deciding that that file is in Unicode rather
    than simple one-byte characters. I understand that that will slow things
    down considerably. I don't know how Perl would make that decision; I have
    little experience on that topic.


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jul 18, 2007
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. boomer

    Large files handling

    boomer, Nov 9, 2003, in forum: XML
    Replies:
    1
    Views:
    368
    Andy Dingley
    Nov 10, 2003
  2. Michael

    Handling large files > 4 GB

    Michael, Jun 22, 2004, in forum: C++
    Replies:
    3
    Views:
    762
    Pete C.
    Jun 23, 2004
  3. Murali
    Replies:
    2
    Views:
    593
    Jerry Coffin
    Mar 9, 2006
  4. Paulo da Silva
    Replies:
    6
    Views:
    381
    Paulo da Silva
    Jun 6, 2010
  5. ccc31807
    Replies:
    18
    Views:
    358
    Rainer Weikusat
    Jan 15, 2013
Loading...

Share This Page