D
dniq00
Hello, oh almighty perl gurus!
I'm trying to implement multithreaded processing for the humongous
amount of logs that I'm currently processing in 1 process on a 4-CPU
server.
What the script does is for each line it checks if the line contains
GET request, and if it does - goes through a list of pre-compiled
regular expressions, trying to find a matching one. Once the match is
found - it uses another regexp, associated with the found match, which
is a bit more complex, to extract data from the line. I have split it
in two separate matches, because about 30% of all lines will match,
and I don't want to run that complex regexp to extract data for all
the lines I know won't match. The goal is to count how many lines
matched for every specific regexp, and the end result is built as a
hash, having data, extracted from the line with second regexp, used as
hash keys, and the value is the number of matches.
Anyway, currently all this is done in a single process, which parses
approx. 30000 lines per second. The CPU usage for this process is
100%, so the bottleneck is in the parsing part.
I have changed the script to use threads + threads::shared +
Thread::Queue. I read data from logs like this:
Code
until( $no_more_data ) {
my @buffer;
foreach( (1..$buffer_size) ) {
if( my $line = <> ) {
push( @buffer, $line );
} else {
$no_more_data = 1;
$q_in->enqueue( \@buffer );
foreach( (1..$cpu_count) ) {
$q_in->enqueue( undef );
}
last;
}
}
$q_in->enqueue( \@buffer ) unless $no_more_data;
}
Then, I create $cpu_count threads, which does something like this:
Code
sub parser {
my $counters = {};
while( my $buffer = $q_in->dequeue() ) {
foreach my $line ( @{ $buffer } ) {
# do its thing
}
}
return $counters;
}
Everything works fine, HOWEVER! It's all so damn slow! It's only 10%
faster than single-process script, consumes about 2-3 times more
memory and about as much times more CPU.
I've also tried abandoning the Thread:Queue and just use
threads::shared with lock/cond_wait/cond_signal combination, without
much success.
I've tried to play with $cpu_count and $buf_size, and found that after
$buf_size > 1000 doesn't make much difference, and $cpu_count > 2
actually makes things a lot worse.
Any ideas why in the world it's so slow? I did some research and
couldn't find a lot of info, other than the way I do it pretty much
the way it should be done, unless I'm missing something...
Hope anybody can enlighten me...
THANKS!
I'm trying to implement multithreaded processing for the humongous
amount of logs that I'm currently processing in 1 process on a 4-CPU
server.
What the script does is for each line it checks if the line contains
GET request, and if it does - goes through a list of pre-compiled
regular expressions, trying to find a matching one. Once the match is
found - it uses another regexp, associated with the found match, which
is a bit more complex, to extract data from the line. I have split it
in two separate matches, because about 30% of all lines will match,
and I don't want to run that complex regexp to extract data for all
the lines I know won't match. The goal is to count how many lines
matched for every specific regexp, and the end result is built as a
hash, having data, extracted from the line with second regexp, used as
hash keys, and the value is the number of matches.
Anyway, currently all this is done in a single process, which parses
approx. 30000 lines per second. The CPU usage for this process is
100%, so the bottleneck is in the parsing part.
I have changed the script to use threads + threads::shared +
Thread::Queue. I read data from logs like this:
Code
until( $no_more_data ) {
my @buffer;
foreach( (1..$buffer_size) ) {
if( my $line = <> ) {
push( @buffer, $line );
} else {
$no_more_data = 1;
$q_in->enqueue( \@buffer );
foreach( (1..$cpu_count) ) {
$q_in->enqueue( undef );
}
last;
}
}
$q_in->enqueue( \@buffer ) unless $no_more_data;
}
Then, I create $cpu_count threads, which does something like this:
Code
sub parser {
my $counters = {};
while( my $buffer = $q_in->dequeue() ) {
foreach my $line ( @{ $buffer } ) {
# do its thing
}
}
return $counters;
}
Everything works fine, HOWEVER! It's all so damn slow! It's only 10%
faster than single-process script, consumes about 2-3 times more
memory and about as much times more CPU.
I've also tried abandoning the Thread:Queue and just use
threads::shared with lock/cond_wait/cond_signal combination, without
much success.
I've tried to play with $cpu_count and $buf_size, and found that after
$buf_size > 1000 doesn't make much difference, and $cpu_count > 2
actually makes things a lot worse.
Any ideas why in the world it's so slow? I did some research and
couldn't find a lot of info, other than the way I do it pretty much
the way it should be done, unless I'm missing something...
Hope anybody can enlighten me...
THANKS!