Processing Multiple Large Files

friend.05 · Dec 11, 2008

Hi,

I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.

Currently my script is reading each file line by line. So it will take
lot of time to process all the files.

Is there any efficient way to do it?

May be Multiprocessing, Multitasking ?

Thanks.

Tim Greer · Dec 11, 2008

Hi,

I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.

Currently my script is reading each file line by line. So it will take
lot of time to process all the files.

When dealing with a lot of data, you usually want to read line by line,
if you can help it. That's the most efficient way when dealing with
large text files. If you have a ton of memory to play with, you can
try other solutions, but even reading line by line, there might be ways
to speed that up, too, depending on a few variables and your needs.

No matter how you go about it, if you have to look at every line in the
file (to use, process, skip, whatever), you're still going to have to
do that and it will have the smaller memory footprint. Maybe it's how
you're going about the task that can be improved? Do you have any
relevant code snippets?

friend.05 · Dec 11, 2008

When dealing with a lot of data, you usually want to read line by line,
if you can help it. That's the most efficient way when dealing with
large text files. If you have a ton of memory to play with, you can
try other solutions, but even reading line by line, there might be ways
to speed that up, too, depending on a few variables and your needs.

No matter how you go about it, if you have to look at every line in the
file (to use, process, skip, whatever), you're still going to have to
do that and it will have the smaller memory footprint. Maybe it's how
you're going about the task that can be improved? Do you have any
relevant code snippets?
--
Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
Industry's most experienced staff! -- Web Hosting With Muscle!

Yes I am reading each file line by line.

But there are more than 200 files. So is there a way if I can process
some files parallely.

Or any other solution to speed up my task.

xhoster · Dec 11, 2008

[email protected] said:
Hi,

I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.

Currently my script is reading each file line by line.

Perl makes it look like you are reading the files line by line.
But really it is using internal buffering to read the files
in larger chunks (well, if the lines are short. If the lines
are long, the chunks may actually be shorter than the lines)

So it will take
lot of time to process all the files.

Is there any efficient way to do it?

Figure out which parts are inefficient, and improve them.

May be Multiprocessing, Multitasking ?

Do you have several CPUs? Can you I/O system keep up with them?

There are kinds of way to do parallel processing in Perl.
In this case, maybe Parallel::ForkManager would be best.
Each process can be assigned a specific one of the 300
files to work on.

See the docs for Parallel::ForkManager.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Peter Makholm · Dec 11, 2008

[email protected] said:
I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.

Currently my script is reading each file line by line. So it will take
lot of time to process all the files.

It depends on what kind of processing you're doing.

If you don't need to process lines in order you might get a speedup by
starting a couple of processes each processing their own files. Again
depending on the kind of processing the optimal number of processes
may vary from the number om cpu's to a copule times the number of
cpu's.

If part of you processing consists of doing a DNS lookup you might be
able to get a speedup by reading a few lines a time a use asyncronous
dns requests (Net:

NS::Async seems to do it) instead of block on each
and every request.

Other optimizations might be possible, but almost everything depends
on the kind of processing you have to do and if you have to process
lines in some predetermined order.

//Makholm

RedGrittyBrick · Dec 11, 2008

Hi,

I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.

Currently my script is reading each file line by line. So it will take
lot of time to process all the files.

Is there any efficient way to do it?

May be Multiprocessing, Multitasking ?

If the 200-300 files are on the same disk, are not especially fragmented
and your program is already IO-bound, parallel processing might
conceivably slow things down by increasing the number of head-seeks needed.

Just a thought.

A. Sinan Unur · Dec 11, 2008

I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.

Currently my script is reading each file line by line. So it will take
lot of time to process all the files.

Is there any efficient way to do it?

May be Multiprocessing, Multitasking ?

Here is one way to do it using Parallel::Forkmanager.

If your system is somewhat typical, you'll probably run into an IO
bottleneck before you run into a CPU bottleneck.

For example:

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> cat create.pl
#!/usr/bin/perl

use strict;
use warnings;

my $line = join("\t", qw( 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 )) .
"\n";

my $fn_tmpl = 'data_%2.2d.txt';
my $fn = sprintf $fn_tmpl, 0;

open my $out, '>', $fn
or die "Cannot open '$fn': $!";

for (1 .. 100_000) {
print $out $line
or die "Cannot write to '$fn': $!";
}

close $out
or die "Cannot close: '$fn': $!";

for (1 .. 19) {
system copy => $fn, sprintf($fn_tmpl, $_);
}

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> timethis create
....
TimeThis : Command Line : create
TimeThis : Start Time : Thu Dec 11 18:14:12 2008
TimeThis : End Time : Thu Dec 11 18:14:16 2008
TimeThis : Elapsed Time : 00:00:03.468

Now, you have 20 input files with 100_000 lines each:

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> dir
....
2008/12/11 06:14 PM 4,100,000 data_00.txt
2008/12/11 06:14 PM 4,100,000 data_01.txt
2008/12/11 06:14 PM 4,100,000 data_02.txt
2008/12/11 06:14 PM 4,100,000 data_03.txt
2008/12/11 06:14 PM 4,100,000 data_04.txt
2008/12/11 06:14 PM 4,100,000 data_05.txt
2008/12/11 06:14 PM 4,100,000 data_06.txt
2008/12/11 06:14 PM 4,100,000 data_07.txt
2008/12/11 06:14 PM 4,100,000 data_08.txt
2008/12/11 06:14 PM 4,100,000 data_09.txt
2008/12/11 06:14 PM 4,100,000 data_10.txt
2008/12/11 06:14 PM 4,100,000 data_11.txt
2008/12/11 06:14 PM 4,100,000 data_12.txt
2008/12/11 06:14 PM 4,100,000 data_13.txt
2008/12/11 06:14 PM 4,100,000 data_14.txt
2008/12/11 06:14 PM 4,100,000 data_15.txt
2008/12/11 06:14 PM 4,100,000 data_16.txt
2008/12/11 06:14 PM 4,100,000 data_17.txt
2008/12/11 06:14 PM 4,100,000 data_18.txt
2008/12/11 06:14 PM 4,100,000 data_19.txt

Here is a simple program to process the data:

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> cat process.pl
#!/usr/bin/perl

use strict;
use warnings;

use Parallel::ForkManager;

my ($instances) = @ARGV;

my $fn_tmpl = 'data_%2.2d.txt';

my $pm = Parallel::ForkManager->new($instances);

for my $i (0 .. 19 ) {
$pm->start and next;

my $input = sprintf $fn_tmpl, $i;

eval {
open my $in, '<', $input
or die "Cannot open '$input': $!";

while ( my $line = <$in> ) {
my @data = split /\t/, $line;

# replace with your own processing code
# don't try to keep all your data in memory
}
close $in
or die "Cannot close '$input': $!";
};

warn $@ if $@;

$pm->finish;
}

$pm->wait_all_children;

__END__

First, try without forking to establish a baseline:

C:\DOCUME~1\asu1\LOCALS~1\Temp\large> timethis process 0

TimeThis : Command Line : process 0
TimeThis : Start Time : Thu Dec 11 18:31:50 2008
TimeThis : End Time : Thu Dec 11 18:32:41 2008
TimeThis : Elapsed Time : 00:00:51.156

Let's try a few more:

TimeThis : Command Line : process 2
TimeThis : Start Time : Thu Dec 11 18:35:15 2008
TimeThis : End Time : Thu Dec 11 18:35:58 2008
TimeThis : Elapsed Time : 00:00:43.578

TimeThis : Command Line : process 4
TimeThis : Start Time : Thu Dec 11 18:36:17 2008
TimeThis : End Time : Thu Dec 11 18:36:59 2008
TimeThis : Elapsed Time : 00:00:41.921

TimeThis : Command Line : process 8
TimeThis : Start Time : Thu Dec 11 18:37:18 2008
TimeThis : End Time : Thu Dec 11 18:38:00 2008
TimeThis : Elapsed Time : 00:00:41.328

TimeThis : Command Line : process 16
TimeThis : Start Time : Thu Dec 11 18:38:18 2008
TimeThis : End Time : Thu Dec 11 18:38:58 2008
TimeThis : Elapsed Time : 00:00:40.734

TimeThis : Command Line : process 20
TimeThis : Start Time : Thu Dec 11 18:39:17 2008
TimeThis : End Time : Thu Dec 11 18:39:58 2008
TimeThis : Elapsed Time : 00:00:40.578

Not very impressive. Between no forking vs max 20 instances, time
required to process was reduced by 20% with most of the gains coming
from running 2. That probably has more to do with the implementation of
fork on Windows than anything else.

In fact, I should probably have used threads on Windows. Anyway, I'll
boot into Linux and see if the returns there are greater.

Try this simple experiment on your system. See how many instances gives
you the best bang-per-buck.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

A. Sinan Unur · Dec 12, 2008

Here is one way to do it using Parallel::Forkmanager.
....

Not very impressive. Between no forking vs max 20 instances, time
required to process was reduced by 20% with most of the gains coming
from running 2. That probably has more to do with the implementation
of fork on Windows than anything else.

In fact, I should probably have used threads on Windows. Anyway, I'll
boot into Linux and see if the returns there are greater.

Hmmm ... I tried it on ArchLinux using perl from the repository on the
exact same hardware as the Windows tests:

[sinan@archardy large]$ time perl process.pl 0

real 0m29.983s
user 0m29.848s
sys 0m0.073s

[sinan@archardy large]$ time perl process.pl 2

real 0m15.281s
user 0m29.865s
sys 0m0.077s

with no changes going to 4, 8, 16 or 20 max instances. Exact same
program and data on the same hardware, yet the no fork version was 40%
faster. Running it in a shell window in xfce4 versus at boot-up on the
console and running it in an ntfs filesystem versus ext3 file system did
not make any meaningful difference.

The wireless connection was up but inactive in all scenarios.

-- Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

cartercc · Dec 12, 2008

I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.
Currently my script is reading each file line by line. So it will take
lot of time to process all the files.

Your question is really about data. The fact that your data is
contained in files which have rows and columns is totally irrelevant.
You would have the same problem if all the data were contained in just
one file. If you have 200,000,000 items of data, you have that much
data, and there's absolutely nothing you can do about it.

Is there any efficient way to do it?

This is a good question, and the answer is, 'Maybe.' If you want to
generate reports from the data, you might want to look into putting in
into a database and writing queries against the database. That's what
companies like Wal-mart, Amazon.com, and eBay do. Write a script that
runs as a cron job at 2:00 am and reads all the data into a database.
Then write another script that queries the database at 4:00 am and
spits out the reports you want.

May be Multiprocessing, Multitasking ?

If you are using an Intel-like processor, it multi processes, anyway.
There are only two ways to increase speed: increase the clocks of the
processor or increase the number of processors. With respect to the
latter, take a look at Erlang. I'd bet a lot of money that you could
write an Erlang script that would increase the speed by several orders
of magnitude. (On my machine, Erlang generates about 60,000 threads in
sevaral milliseconds, and I have an old, slow machine.)

CC

Peter J. Holzer · Dec 13, 2008

A. Sinan Unur said:
A. Sinan Unur said:

I analyzing some netwokr log files. There are around 200-300 files
and each file has more than 2 million entries in it. [...]
Is there any efficient way to do it?

May be Multiprocessing, Multitasking ?

Click to expand...

Here is one way to do it using Parallel::Forkmanager.
...

Not very impressive. Between no forking vs max 20 instances, time
required to process was reduced by 20% with most of the gains coming
from running 2. That probably has more to do with the implementation
of fork on Windows than anything else.

In fact, I should probably have used threads on Windows. Anyway, I'll
boot into Linux and see if the returns there are greater.

Click to expand...

Hmmm ... I tried it on ArchLinux using perl from the repository on the
exact same hardware as the Windows tests:

[sinan@archardy large]$ time perl process.pl 0

real 0m29.983s
user 0m29.848s
sys 0m0.073s

[sinan@archardy large]$ time perl process.pl 2

real 0m15.281s
user 0m29.865s
sys 0m0.077s

with no changes going to 4, 8, 16 or 20 max instances. Exact same
program and data on the same hardware, yet the no fork version was 40%
faster.

Where do you get this 40% figure from? As far as I can see the forking
version is almost exactly 100% faster (0m15.281s instead of 0m29.983s)
than the non-forking version.

This is to be expected. Your small test files fit completely into memory
even on rather small systems and if you ran process.pl directly after
create.pl, they almost certainly were. So the task is completely
CPU-bound, and if you have at least two cores (most current computers
have) two processes should be twice as fast as one.

Here is what I get for

for i in `seq 0 25`
do
echo -n "$i "
time ./process $i
done

on a dual-core system:

0 ./process $i 20.85s user 0.10s system 99% cpu 21.024 total
1 ./process $i 22.03s user 0.06s system 99% cpu 22.146 total
2 ./process $i 21.86s user 0.04s system 197% cpu 11.093 total
3 ./process $i 22.63s user 0.09s system 197% cpu 11.505 total
[...]
23 ./process $i 21.67s user 0.15s system 199% cpu 10.956 total
24 ./process $i 22.91s user 0.10s system 199% cpu 11.553 total
25 ./process $i 22.05s user 0.08s system 199% cpu 11.124 total

Two processes are twice as fast as one, but adding more processes
doesn't help (but doesn't hurt either).

And here's the output for an 8-core system:

0 ./process $i 10.22s user 0.05s system 99% cpu 10.275 total
1 ./process $i 10.13s user 0.07s system 100% cpu 10.196 total
2 ./process $i 10.19s user 0.06s system 199% cpu 5.138 total
3 ./process $i 10.19s user 0.06s system 284% cpu 3.606 total
4 ./process $i 10.19s user 0.06s system 395% cpu 2.589 total
5 ./process $i 10.18s user 0.06s system 472% cpu 2.167 total
6 ./process $i 10.20s user 0.05s system 495% cpu 2.069 total
7 ./process $i 10.20s user 0.07s system 650% cpu 1.580 total
8 ./process $i 10.18s user 0.06s system 652% cpu 1.571 total
9 ./process $i 10.19s user 0.05s system 659% cpu 1.553 total
10 ./process $i 10.20s user 0.06s system 667% cpu 1.538 total
11 ./process $i 10.19s user 0.06s system 666% cpu 1.538 total
12 ./process $i 10.19s user 0.06s system 706% cpu 1.451 total
13 ./process $i 10.19s user 0.05s system 662% cpu 1.545 total
14 ./process $i 10.19s user 0.06s system 689% cpu 1.486 total
15 ./process $i 10.19s user 0.05s system 708% cpu 1.446 total
16 ./process $i 10.20s user 0.06s system 755% cpu 1.357 total
17 ./process $i 10.22s user 0.06s system 756% cpu 1.360 total
18 ./process $i 10.20s user 0.05s system 741% cpu 1.383 total
19 ./process $i 10.21s user 0.06s system 729% cpu 1.407 total
20 ./process $i 10.23s user 0.05s system 726% cpu 1.415 total
21 ./process $i 10.20s user 0.06s system 749% cpu 1.368 total
22 ./process $i 10.21s user 0.05s system 726% cpu 1.411 total
23 ./process $i 10.23s user 0.06s system 739% cpu 1.392 total
24 ./process $i 10.21s user 0.04s system 712% cpu 1.440 total
25 ./process $i 10.20s user 0.05s system 739% cpu 1.386 total

Speed rises almost linearly until 7 processes (which manage to use 6.5
cores). Then it gets still a bit faster until 16 to 17 processes (using
7.5 cores) and after that it levels off. Not quite what I expected but
close enough.

For the OP's problem, this test is most likely not representative: He
has a lot more files and each is larger. So they may not fit into the
cache, and even if they do, they probably aren't in the cache when his
script runs (depends on how long ago they were last read/written and how
busy the system is).

hp

Peter J. Holzer · Dec 13, 2008

I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.
Currently my script is reading each file line by line. So it will take
lot of time to process all the files. [...]

May be Multiprocessing, Multitasking ?

Click to expand...

If you are using an Intel-like processor, it multi processes, anyway.

No. At least not on a level you notice. From a perl programmer's view
(or a Java or C programmer's), each core is a separate CPU. A
single-threaded program will not become faster just because you have two
or more cores. You have to program those threads (or processes)
explicitely to get any speedup. See my results for Sinan's test program
for a dual- and eight core machine.

There are only two ways to increase speed: increase the clocks of the
processor or increase the number of processors. With respect to the
latter, take a look at Erlang. I'd bet a lot of money that you could
write an Erlang script that would increase the speed by several orders
of magnitude.

I doubt that very much. Erlang is inherently multithreaded so you don't
have to do anything special to use those 2 or 8 cores you have, but it
doesn't magically make a processor "several orders of magnitude" faster,
unless you have hundreds or thousands of processors.

I think Erlang is usually compiled to native code (like C), so it may
well be a few orders of magnitude faster than perl because of that. But
that depends very much on the problem and extracting stuff from text
files is something at which perl is relatively fast.

(On my machine, Erlang generates about 60,000 threads in
sevaral milliseconds, and I have an old, slow machine.)

Which says nothing about how long it takes to parse a line in a log
file.

hp

xhoster · Dec 14, 2008

Peter J. Holzer said:
Not very impressive. Between no forking vs max 20 instances, time
required to process was reduced by 20% with most of the gains coming
from running 2. That probably has more to do with the implementation
of fork on Windows than anything else.

In fact, I should probably have used threads on Windows. Anyway, I'll
boot into Linux and see if the returns there are greater.

Click to expand...

Hmmm ... I tried it on ArchLinux using perl from the repository on the
exact same hardware as the Windows tests:

[sinan@archardy large]$ time perl process.pl 0

real 0m29.983s
user 0m29.848s
sys 0m0.073s

[sinan@archardy large]$ time perl process.pl 2

real 0m15.281s
user 0m29.865s
sys 0m0.077s

with no changes going to 4, 8, 16 or 20 max instances. Exact same
program and data on the same hardware, yet the no fork version was 40%
faster.

Click to expand...

Where do you get this 40% figure from? As far as I can see the forking
version is almost exactly 100% faster (0m15.281s instead of 0m29.983s)
than the non-forking version.

I assumed he was comparing Linux to Windows, not within linux.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

sln · Dec 14, 2008

Hi,

I analyzing some netwokr log files. There are around 200-300 files and
each file has more than 2 million entries in it.

Currently my script is reading each file line by line. So it will take
lot of time to process all the files.

Is there any efficient way to do it?

May be Multiprocessing, Multitasking ?

Thanks.

I'm estimating 100 characters perl line, 2 million lines per file,
at 300 files, will be about 60 Gigabytes of data to be read.

If the files are are to be read across a real 1 Gigabit network, just
reading the data will take about 10 minutes (I think, or about 600 seconds).
A Gigabit ethernet, theoretically can transmit 100 MB/second, if its cache
is big enough. But that includes packetizing data and protocol ack/nak's.
So, in reality, its about 50 MB/second.

Some drives can't read/write that fast. Its the upper limit of
some small raid systems. So the drives may not actually be able to keep up
with network read requests if thats all your doing.
The cpu will be mostly idle.

In reality though, your not reading huge blocks of data like on a disk to
disk transfer, your processing line by line.

In this case, the I/O requests are incrementally idle between cpu, line by
line file processing. AND the cpu sits Idle waiting for the incremental I/O
request.

So where is the time being lost? Well, its being lost in both places, on the
cpu and the I/O.

The cpu can process an equivalent menutia, of about 25 GB's/second, ram
at about 2-4 GB's/second, the hard drive (being slower than Gigabit ethernet)
at about 25 MB's/second

For the cpu:
It would be better to keep the cpu working all the time and not waiting for
I/O completion on single requests. The way to do this is to have many requests
submitted at one time. I would say 25 threads, on 25 different files at a time.
Your running the same function, just on a different thread. You just have to
know which file is next.
And multiple threads beats multiple processes hands down simply because process
switching takes much more overhead than thread switching. Another reason is that
you have 25 buffers waiting for the I/O data instead of just 1.

For the I/O:
When the I/O always has a request pending (cached) it doesen't have to wait for the
cpu. Most memory transfer will be via dma controller, not the cpu.

If you don't do multiple threads at all, there is still ways to speed it up.
Even if you create a cache of 10 lines at a time before you process, it would be
better than no threads at all.

Good luck.

sln

sln · Dec 14, 2008

[...]
If you don't do multiple threads at all, there is still ways to speed it up.
Even if you create a cache of 10 lines at a time before you process, it would be
better than no threads at all.

Of course its a block read, not line by line.

sln

A. Sinan Unur · Dec 14, 2008

Peter J. Holzer said:
Peter J. Holzer said:

....
In fact, I should probably have used threads on Windows. Anyway,
I'll boot into Linux and see if the returns there are greater.

Hmmm ... I tried it on ArchLinux using perl from the repository on
the exact same hardware as the Windows tests:

[sinan@archardy large]$ time perl process.pl 0

real 0m29.983s
user 0m29.848s
sys 0m0.073s

[sinan@archardy large]$ time perl process.pl 2

real 0m15.281s
user 0m29.865s
sys 0m0.077s

with no changes going to 4, 8, 16 or 20 max instances. Exact same
program and data on the same hardware, yet the no fork version was
40% faster.

Click to expand...

Where do you get this 40% figure from? As far as I can see the
forking version is almost exactly 100% faster (0m15.281s instead of
0m29.983s) than the non-forking version.

Click to expand...

I assumed he was comparing Linux to Windows, not within linux.

A very astute observation ;-)

My purpose was to show to the OP how to test if forking etc could
provide performance gains. I did not think so (I did say "you are going
to run into IO bottlenecks before you run into CPU bottlenecks").

I was still astonished by the fact that the exact same Perl program,
with the exact same data, on the exact same hardware, being run under
the latest available perl binary for each platform, was faster in
ArchLinux than in Windows XP.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Tim Greer · Dec 14, 2008

If the files are are to be read across a real 1 Gigabit network, just
reading the data will take about 10 minutes (I think, or about 600
seconds). A Gigabit ethernet, theoretically can transmit 100
MB/second, if its cache is big enough. But that includes packetizing
data and protocol ack/nak's. So, in reality, its about 50 MB/second.

By all accounts, this aspect is probably irrelevant, as there was no
mention of needing to transfer the data across a network. If this is
the case, I'm hoping the OP mentions it and any other aspect that could
play a potential role. Still, I'm pretty certain they mean they'll
process the data on the system the data resides on, or else they'll
transfer it to another system and then process the data on the system
the data is then (now) on. Otherwise, there are certainly other
aspects to consider, to be sure.

sln · Dec 14, 2008

By all accounts, this aspect is probably irrelevant, as there was no
mention of needing to transfer the data across a network. If this is
the case, I'm hoping the OP mentions it and any other aspect that could
play a potential role. Still, I'm pretty certain they mean they'll
process the data on the system the data resides on, or else they'll
transfer it to another system and then process the data on the system
the data is then (now) on. Otherwise, there are certainly other
aspects to consider, to be sure.

OP:
"Hi,

I analyzing some netwokr log files. There are around ...
"

sln

sln · Dec 14, 2008

OP:
"Hi,

I analyzing some netwokr log files. There are around ...
"

sln

In Chinese, this translates to "I got your US job files,
no need to keep your workers, fire thos bastards and join
the Communist revolution"

sln

Tim Greer · Dec 14, 2008

"Hi,

I analyzing some netwokr log files. There are around ...
"

I didn't get the impression that meant the large preexisting logs needed
to be transfered or read over the network as they were processed, but
people have done stranger things, I suppose.

cartercc · Dec 18, 2008

No. At least not on a level you notice. From a perl programmer's view
(or a Java or C programmer's), each core is a separate CPU. A
single-threaded program will not become faster just because you have two
or more cores. You have to program those threads (or processes)
explicitely to get any speedup. See my results for Sinan's test program
for a dual- and eight core machine.

I've never written a multithreaded Perl program, but I have written
multithreaded programs in C and Java, and the big problem IMO is that
all those threads typically want a piece of your shared object, so you
have to build a gate keeper to allow all those threads into your
shared object one at a time. I expect that Perl is the same.

I doubt that very much. Erlang is inherently multithreaded so you don't
have to do anything special to use those 2 or 8 cores you have, but it
doesn't magically make a processor "several orders of magnitude" faster,
unless you have hundreds or thousands of processors.

That's true, but that's not really the point. The point is that Erlang
is an asynchronous message passing language, so you don't have to keep
locking and unlocking shared objects. From my POV as a software guy
(not a hardware guy) not having to explicitly programs mutexes and
semaphores is great. But yeah, theoretically Erlang program runtime
speed is proportional to the number of processors you have.

I think Erlang is usually compiled to native code (like C), so it may
well be a few orders of magnitude faster than perl because of that. But
that depends very much on the problem and extracting stuff from text
files is something at which perl is relatively fast.

Yes. You can't really make a priori statements about execution speed,
and Perl is optimized for data manipulation while Erlang isn't, but
still, Erlang is optimized for multithreading which might be the
answer that the OP was looking for.

Which says nothing about how long it takes to parse a line in a log
file.

Which was my point to begin with. If you have 2M records, you have 2M
records and you've got to deal with it.

BTW, I've never written an Erlang program to do what I use Perl for,
I'm not sure if it can be done, and I don't know if it would have any
benefit. However, I've seen what Erlang can do with multithreaded
apps, and I certainly think that Erlang is a strong competitor for
those kinds of applications.

CC

Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
Question about multiple metadata files to one file	0	Feb 14, 2022
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
String Processing Basic Stuff	3	Oct 21, 2008
Processing large files with TextFieldParser	3	Nov 30, 2009
Search a Large files backwards	7	Mar 2, 2010
Efficient processing of large nuumeric data file	12	Jan 18, 2008
Astronomy image processing contest	0	Nov 19, 2010

Processing Multiple Large Files

friend.05

Tim Greer

friend.05

xhoster

Peter Makholm

RedGrittyBrick

A. Sinan Unur

A. Sinan Unur

cartercc

Peter J. Holzer

Peter J. Holzer

xhoster

sln

sln

A. Sinan Unur

Tim Greer

sln

sln

Tim Greer

cartercc

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads