Handling Large files (a few Gb) in Perl

S

sydches

Hi,

I am a beginner (or worse) at Perl.

I have a need to find the longest line (record) in a file. The below
code works neatly for small files.
But when I need to read huge files (in the order of Gb), it is very
slow.

I need to write an output file with stuff like:
Longest line is... occurring on line number...
There are ... lines in the file

The same file is crunched using C in about 30 milliseconds!
The difference in run times of Perl/VbScript and C is a significant
one.

Could someone help me in finding what way I could make Perl work the
best way for processing huge files such as these?

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
my $prev=-1;
my $curr=0;

my ($sec,$min,$hour,$com) = localtime(time);
print "Start time - $hour:$min:$sec \n";

open(F1, "c:\\perl\\syd\\del.txt");

while (<F1>)
{
$curr = index($_, "\x0A");
if($curr > $prev)
{
$prev = $curr;
}
}
close(F1);

my ($sec,$min,$hour,$com) = localtime(time);
print "End time - $hour:$min:$sec \n";
print "Lengthiest record length: $prev \n";

The output times for a 1 Gb is
Start time - 20:32:31
End time - 20:34:28
Lengthiest record length: 460

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I am running this on a laptop with Windows XP, 1.7 GHz processor with
1 Gb of RAM
I am using ActivePerl

Thanks in advance!
Syd
 
P

Paul Lalli

I am a beginner (or worse) at Perl.

I have a need to find the longest line (record) in a file. The below
code works neatly for small files.
But when I need to read huge files (in the order of Gb), it is very
slow.
Could someone help me in finding what way I could make Perl work the
best way for processing huge files such as these?

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
my $prev=-1;
my $curr=0;

my ($sec,$min,$hour,$com) = localtime(time);
print "Start time - $hour:$min:$sec \n";

open(F1, "c:\\perl\\syd\\del.txt");

while (<F1>)
{
$curr = index($_, "\x0A");

Well here's one improvement you could make. Don't force Perl to
search through each string looking for a specific character. Just ask
it what the lenght of the string is. In my tests, that's about 10%
faster:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw/:all/;

sub use_index {
open my $fh, '<', 'ipsum.txt' or die $!;
my $prev = 0;
while (<$fh>) {
my $cur = index($_, "\x0A");
if ($cur > $prev) {
$prev = $cur;
}
}
}

sub use_length {
open my $fh, '<', 'ipsum.txt' or die $!;
my $prev = 0;
while (<$fh>) {
my $cur = length;
if ($cur > $prev) {
$prev = $cur;
}
}
}

cmpthese(timethese(100_000, { length => \&use_length, index =>
\&use_index }));
__END__

Benchmark: timing 100000 iterations of index, length...
index: 26 wallclock secs (19.81 usr + 6.27 sys = 26.08 CPU) @
3834.36/s (n=100000)
length: 24 wallclock secs (17.10 usr + 6.47 sys = 23.57 CPU) @
4242.68/s (n=100000)
Rate index length
index 3834/s -- -10%
length 4243/s 11% --



Paul Lalli
 
S

sydches

Hi Paul,

Thank you for your suggestion. This does speed up things quite a bit.

Is there any other way to speed this up much faster? It is still slow
and my pc hangs up on me, for large files.

Warm regards!
Syd
 
J

J. Gleixner

Hi Paul,

Thank you for your suggestion. This does speed up things quite a bit.

Is there any other way to speed this up much faster?

Write it in C.
It is still slow
and my pc hangs up on me, for large files.

It shouldn't 'hang' your PC. It might use a large percentage of
the CPU though.
 
X

xhoster

Hi,

I am a beginner (or worse) at Perl.

I have a need to find the longest line (record) in a file. The below
code works neatly for small files.
But when I need to read huge files (in the order of Gb), it is very
slow.

I need to write an output file with stuff like:
Longest line is... occurring on line number...
There are ... lines in the file

The same file is crunched using C in about 30 milliseconds!

I can't get anywhere near that speed in C. Can you post your C code,
and some Perl code that generates a sample file to be operated on?

The difference in run times of Perl/VbScript and C is a significant
one.

Could someone help me in finding what way I could make Perl work the
best way for processing huge files such as these?

Since you already have a C program which works, use it.

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
my $prev=-1;
my $curr=0;

my ($sec,$min,$hour,$com) = localtime(time);
print "Start time - $hour:$min:$sec \n";

open(F1, "c:\\perl\\syd\\del.txt");

while (<F1>)
{
$curr = index($_, "\x0A");

You are searching for the end of line marker twice, once implicitly in the
readline (<F1>) and once here. And you already know where it will be
found the second time--either at the end, or no where.

Xho
 
P

Peter J. Holzer

I have a need to find the longest line (record) in a file. The below
code works neatly for small files.
But when I need to read huge files (in the order of Gb), it is very
slow.

I need to write an output file with stuff like:
Longest line is... occurring on line number...
There are ... lines in the file

The same file is crunched using C in about 30 milliseconds! [...]
I am running this on a laptop with Windows XP, 1.7 GHz processor with
1 Gb of RAM

I don't believe that. Since you have only 1 GB RAM, you can't keep a
1 GB file completely in memory. And you can't read a 1 GB file from disk
in 30 milliseconds - certainly not from a laptop hard disk (30 seconds
sounds more likely). Even if the whole file is cached in RAM I think
that you can't scan 1GB of RAM in 30 ms (The new Power6 CPU claims a
*maximum* memory read bandwidth of 40 GB/s - theoretically enough to
scan 1 GB in 25 ms, but I doubt you get even close to that number in
practice). My best attempt takes about 2 seconds user time (1.85 GHz
Core2). I won't be surprised if somebody can improve this by an order of
magnitude, but anything more requires serious magic.

Just for comparison. Your script takes about 20.5 seconds on my system.
The obvious optimization (using length instead of index) brings it down
to 19.3 seconds. A naive portable C version (using stdio) is about as
fast as your script (21.0 seconds), and a naive C version using mmap
and strtok is much slower (37.4 seconds), but very much reduces CPU
time. I guess by combining low level I/O calls (maybe even async I/O)
and strtok I could get close to 15 seconds, which should be just about
possible with the disk I have.

hp
 
S

sydches

Hi,

My apologies. The times were off the mark. It takes less than 3
seconds (277 milliseconds).
The start time is: 14:04:35.97
The end time is: 14:04:38.74

Here's the source code in C.
***************************************Begin***************************************
# include <stdio.h>
# include <conio.h>
# include <fstream.h>
# include <time.h>

//-------------------------------------------------------------------------//
// This program reads a file - terminated by a carriage return and
reports //
// the length of the longest record in a
file. //
//-------------------------------------------------------------------------//

int main ( int argc, char *argv[] );
void handle ( char input_file_name[], int *wide_line_width,
int *wide_line_number);
void timestamp ( void );

int main ( int argc, char *argv[] )
{
int i;
char input_file_name[80];
int wide_line_number;
int wide_line_width;

clrscr();

textattr(6 + ((1) << 5));
highvideo();
cprintf("\n\n\n\n");
cprintf("
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ");
cprintf(" ³ CHECK THE MAX RECORD LENGTH IN A DOS/PC
FILE ³ ");
cprintf("
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ ");
printf("\n");

if ( argc < 2 )
{
cout << " Enter the input file name:\n";
cout << "\n ";
cin.getline ( input_file_name, sizeof ( input_file_name ) );

cout << "\n";
cout << " Started - ";
timestamp ( );

handle ( input_file_name, &wide_line_width, &wide_line_number);

cout << "\n";
cout << " The longest line of \"" << input_file_name
<< "\" has length " << wide_line_width;
}
else
{
for ( i = 1 ; i < argc ; ++i )
{
handle ( argv, &wide_line_width, &wide_line_number);

cout << " The longest line of \"" << argv
<< "\" has length " << wide_line_width;
}
}
cout << "\n";
cout << " Ended - ";
timestamp ( );

textattr(6 + ((1) << 5));
highvideo();
cprintf("\n\n\n\n");
cprintf("
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ");
cprintf(" ³ WRITTEN BY VINAY
MAKAM ³ ");
cprintf("
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ ");

getchar();
return 0;
}
//-------------------------------------------------------------------------//
void handle ( char input_file_name[], int *wide_line_width,
int *wide_line_number )
{
int big_number;
int big_width;
char c;
ifstream input_file;
int input_file_width;
int line_number;
int line_width;

big_width = -1;
big_number = -1;

input_file.open( input_file_name );

if ( !input_file )
{
cout << "\n";
cout << "Fatal error!\n";
cout << " Cannot open the input file " << input_file << ".\n";
return;
}

big_width = 0;
line_width = 0;
line_number = 0;

while ( 1 )
{
input_file.get ( c );

if ( input_file.eof ( ) )
{
break;
}

if ( c == '\n' )
{
line_number = line_number + 1;

if ( big_width < line_width )
{
big_width = line_width;
big_number = line_number;
}
line_width = 0;
}
else
{
line_width = line_width + 1;
}
}

input_file.close ( );

*wide_line_width = big_width;
*wide_line_number = big_number;

return;
}
//-------------------------------------------------------------------------//
void timestamp ()
{
#define TIME_SIZE 40

static char time_buffer[TIME_SIZE];
const struct tm *tm;
size_t len;
time_t now;

now = time ( NULL );
tm = localtime ( &now );

len = strftime ( time_buffer, TIME_SIZE, " %I:%M:%S %p", tm );
len = len ;
cout << time_buffer << "\n";
return ;
#undef TIME_SIZE
}
****************************************End****************************************

I am generating test files by copying a reasonable large file
iteratively, in DOS.
copy TestFile + TestFile TestFileDoubled

Thank you very much for all the suggestions!
Syd
 
M

Mirco Wahab

Hi,

My apologies. The times were off the mark. It takes less than 3
seconds (277 milliseconds).
The start time is: 14:04:35.97
The end time is: 14:04:38.74

Here's the source code in C.
....

while ( 1 )
{
input_file.get ( c );

if ( input_file.eof ( ) )
{
break;
}

if ( c == '\n' )
{
line_number = line_number + 1;

if ( big_width < line_width )
{
big_width = line_width;
big_number = line_number;
}
line_width = 0;
}
else
{
line_width = line_width + 1;
}
}

input_file.close ( );


This is entirely impossible. I guess your
C++ "test situation" doesn't touch the 1G
file at all. Your 0,3 msec or 3,0 sec are
the time needed to load the application
into ram - which terminates after startup.
Thats it (possibly).

The perl solution *does* obviously check
each line and returns the expected result.

My fast-hacked C solution reads a 1G file
in ~28 sec (2M Lines, mean length 500 Bytes)
on a Athlon64/3200 1Gig WinXP.

my 0,02 €

Regards

M.
 
M

Mirco Wahab

Hi,

I am a beginner (or worse) at Perl.

I have a need to find the longest line (record) in a file. The below
code works neatly for small files.
But when I need to read huge files (in the order of Gb), it is very
slow.

I am running this on a laptop with Windows XP, 1.7 GHz processor with
1 Gb of RAM
I am using ActivePerl

Id did some test on a linux machine (Athlon XP/2500+, 1Gig)
to clear this up.

First, is put the stuff on the good ole Maxtor server
drive, generate the 1G file and run the perl program:

winner: 1002 at 795

real 0m31.034s
user 0m7.080s
sys 0m2.350s

The real perl process did need 7 seconds then + 2 seconds
operating system file handling. The difference up to the
whole time of 31 sec is the time needed to get the file
from the raw disk drive (a 1G file won't be buffered).


Next, I move the dir to a new WD server drive, generate
the 1G file an run the perl program:

winner: 1002 at 200

real 0m15.603s
user 0m7.290s
sys 0m2.030s


What we see here - the whole time has been halved,
but the time the perl needed is almost exactly the
same. We measured the different disk bandwidths.


Regards

M.

Perl source used:

==>
use strict;
use warnings;

my ($l, $n) = (-1, -1);

open my $fh, '<', 'del.txt' or die "can't do anything $!";
while( <$fh> ) {
($l, $n) = (length, $.) if $l < length
}
close $fh;

print "winner: $l at $n\n";
<==
 
S

sydches

Hi,

I think it's only fair that we do this on the same file.

***************************************Begin*******************************­
********
open(F1, ">c:/perl/testout/HugeFile.txt");

for ($index = 0; $index <= 1000000; $index++)
{
print F1 "This line is 32 characters long \n";
print F1 "This line
is
101 characters long \n";
}
close F1;
****************************************End********************************­
********

The C code times are as follows:
The current time is: 17:30:47.20
The current time is: 17:30:55.00

Thanks!
Sydney
 
M

Mirco Wahab

Hi,

I think it's only fair that we do this on the same file.
open(F1, ">c:/perl/testout/HugeFile.txt");

for ($index = 0; $index <= 1000000; $index++)
{
print F1 "This line is 32 characters long \n";
print F1 "This line
is
101 characters long \n";
}
close F1;

Your file size will be

1000000 * (32 + 101) ==> 133000000

which is almost 128 MB or 0.128 GB

Try:

==>

use strict;
use warnings;

my $count = 10_000_000;

open my $fh, '>', 'del.txt' or die "can't write $!";

print $fh (
'This line is 32 characters long
..................................... This line is 101 characters long ..............................
' ) while $count--;

close $fh;

<==

This will result in a file of 128GB. Post your C results then.

Regards

M.
 
M

Mirco Wahab

Mirco said:
This will result in a file of 128GB. Post your C results then.

OOPS, lost the dot:

must read: "This will result in a file of 1.28GB"

Sorry, M.
 
M

Mumia W.

Hi,

My apologies. The times were off the mark. It takes less than 3
seconds (277 milliseconds).
The start time is: 14:04:35.97
The end time is: 14:04:38.74
[...]

No, this is not 277 milliseconds; it's 2.77 seconds or 2770 milliseconds.
Here's the source code in C. [...]

No, it's C++.
# include <fstream.h>
[...]

You must have a fast machine. Either that or your program is buggy.

I have a 1300mhz AMD CPU with 512mb. I wrote both a C and a Perl version
of this program, and this is what I got:
$ ls -ln ~/tmp/junk/big
-rw-r--r-- 1 **** **** 2088763392 2007-07-17 06:33 /home/****/tmp/junk/big
$
$ cat count-lines.c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

int main (int argc, const char ** argv)
{
const char * filename = 0;
FILE * handle = 0;
time_t starttime = 0;
time_t endtime = 0;
long line_number = 0;
long line_length = 0;
long lnno = 0;
static char line [10000];

if (argc < 2) {
fprintf(stderr, "No filename\n");
return EXIT_FAILURE;
}

filename = argv[1];
handle = fopen(filename, "r");
if (0 == handle) {
perror(filename);
return EXIT_FAILURE;
}

starttime = time(0);

while (fgets(line,sizeof(line),handle)) {
lnno++;
long length = strlen(line);
if (length > line_length) {
line_length = length;
line_number = lnno;
}
}
fclose(handle);

endtime = time(0);

printf("%ld is the longest line with %ld characters\n",
line_number, line_length);
printf("%ld seconds elapsed\n", (long) (endtime-starttime));

return EXIT_SUCCESS;
}


$ cat count-lines.pl
#!/usr/bin/perl
use strict;
use warnings;

my ($line_number, $line_length) = (0,0);
my $starttime = time();

while (<>) {
my $length = length($_);
if ($length > $line_length) {
$line_length = $length;
$line_number = $.;
}
}
close(ARGV);

my $endtime = time();

print "$line_number is the longest line with $line_length characters\n";
printf "%d seconds elapsed\n", ($endtime-$starttime);

$
$
$ ./count-lines ~/tmp/junk/big
9580 is the longest line with 3836 characters
51 seconds elapsed
$
$ ./count-lines.pl ~/tmp/junk/big
9580 is the longest line with 3836 characters
106 seconds elapsed
$
$ # For a bytecode-compiled scripting language, that's pretty damn good!
$

I expected Perl to take ten to twenty times longer than C. I'm amazed
that it's only about twice as slow. The fact that Perl can almost keep
up with C means that Perl is ultra-efficient with character processing :-D

However, your time of 2.77 seconds stretches my belief muscles too far.
What kind of machine are you running on?

PS.
I was using the ext3 filesystem during the test. I can probably get much
better results by using ext2 if I'm willing to forgo filesystem
journalizing--which I'm not willing to do.
 
M

Mumia W.

Hi, [ program snipped ]

However, your time of 2.77 seconds stretches my belief muscles too far.
What kind of machine are you running on?
[...]

Sorry about that. Of course your data is not my data, and of course some
people will have machines that are 100 times faster than mine.
 
M

Mumia W.

Your file size will be

1000000 * (32 + 101) ==> 133000000

which is almost 128 MB or 0.128 GB

Try:

==>

use strict;
use warnings;

my $count = 10_000_000;

open my $fh, '>', 'del.txt' or die "can't write $!";

print $fh (
'This line is 32 characters long
..................................... This line is 101 characters long
..............................
' ) while $count--;

close $fh;

<==

This will result in a file of 128GB. Post your C results then.

Regards

M.

My timing for the C program is similar to yours (same data with a
different program).
$ (cd ~/tmp/junk ; ls -ln del.txt)
-rw-r--r-- 1 **** **** 1340000000 2007-07-17 07:55 del.txt
$
$ ./count-lines ~/tmp/junk/del.txt
2 is the longest line with 102 characters
26 seconds elapsed
$
$ ./count-lines.pl ~/tmp/junk/del.txt
2 is the longest line with 102 characters
38 seconds elapsed
$

"Count-lines" is the C program's binary, and count-lines.pl is,
obviously, the Perl program.

I'm still impressed by Perl's speed.

Yes, I know the wording of my program's output needs work. Line 2 is
only one of the 10 million longest lines in the file ;-)
 
S

sydches

Hi,

Using the Perl code that Mirco gave, I created a file which is 1.27 GB
in size.

Ran the C++ code and the times are:
The current time is: 19:18:52.21
The current time is: 19:23:09.65

I am on a Windows XP system (1.7 GHz processor with 1 Gb of RAM), and
using ActivePerl.

Thanks!
Syd
 
M

Mirco Wahab

Using the Perl code that Mirco gave, I created a file which is 1.27 GB
in size.
Ran the C++ code and the times are:
The current time is: 19:18:52.21
The current time is: 19:23:09.65

OK, this is 256 seconds which is what
one may expect from a loaded win-xp
machine of 1,7GHz.

I did again another test with the very file
(1,2GB) on an old unix machine (Athlon XP/2500+,
running from a WD3200JB-ext3).

After compiling (gcc 4.1) your C++-Programm (commenting out
non-unixish stuff) with: g++ -O3 -o sydches sydches.cxx

I see the following results:
$> time ./sydches del.txt

....

real 1m1.888s
user 0m54.270s
sys 0m3.070s

(the whole process takes ~62sec) - whereas the short Perl
script provided in another post shows the following:
$> time perl longest.pl

....

real 0m28.218s
user 0m21.140s
sys 0m3.330s

(which is more than twice as fast). Some Perl program:

...
open my $fh, '<', 'del.txt' or die "can't do anything $!";
while( <$fh> ) {
($l, $n) = (length, $.) if $l < length
}
close $fh;
...

may be therefore, as one can see, much
much faster than a 'non-optimally' written
C/C++ program.

Regards

Mirco
 
S

sydches

Hi,

I started running both my programs (C++ and Perl) on a whole lot of
test files. And I noticed something.

The C code and Perl code run in almost the same time!
Except a single 1 Gb file which the C code does in around 250
milliseconds, and Perl takes about 2-3 minutes for that!
The output says 460 is the longest record.

Unfortunately, I am unable to open this file directly (because of the
size).
I am going to try a splitter to see what kind of data is on this one
file.

Thanks for all the help. Boy, did I learn a whole lot talking to you
guys!

Warm regards!
Syd
 
X

xhoster

Hi,

I started running both my programs (C++ and Perl) on a whole lot of
test files. And I noticed something.

The C code and Perl code run in almost the same time!
Except a single 1 Gb file which the C code does in around 250
milliseconds, and Perl takes about 2-3 minutes for that!
The output says 460 is the longest record.

Unfortunately, I am unable to open this file directly (because of the
size).
I am going to try a splitter to see what kind of data is on this one
file.

I wonder if Perl is somehow deciding that that file is in Unicode rather
than simple one-byte characters. I understand that that will slow things
down considerably. I don't know how Perl would make that decision; I have
little experience on that topic.


Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,139
Latest member
JamaalCald
Top