Shrink large file according to REG_EXP

T

thellper

Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

Any help is really appreciated.

Best regards,
Davide
 
X

xhoster

thellper said:
Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Figure out which regex is slow, why it is slow, and then make it faster.

If you did the first step and posted the culprit with some sample input, we
might be able to help with the latter two.
Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

I'd try to make the single-threaded one faster first, and resort to
parallelization only as a last resort. Also, if I were doing
parallelization of this, I probably wouldn't use forks.pm to do it. Once
started, your threads (or processes) really don't need to communicate with
each other (as long as you make independent output files to be combined
later) , so a simpler solution, like Parallel::ForkManager or just doing
fork yourself. Or just start the jobs as separate processes in the first
place.

If the orders of the lines in the output files isn't important, I'd give
each job a different integer token (from 0 to num_job-1) and then have each
job process only those lines where
$token == $. % $num_job

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
T

Ted Zlatanov

t> The problem is that this solution is slow. I'm now reading line by
t> line the whole file, and then I'm applying the reg_exp... but it is
t> very slow. I've noticed that the time to read and write the file
t> without doing anything is very small, so I'm loosing a lot of time
t> for my reg_exps... .

t> Ok, the whole program is more complicated: the files may have
t> different syntax, and I have syntax files which tell me how to split
t> each line in its fields. Then I load separately files with the rules
t> (the reg_exps) used to filter them.... . Anyway, my idea was to try
t> to use the FORKS.pm module (s. CPAN) to split the file in chunks and
t> let each thread work on a chunk of the file: can somebody tell me how
t> to do this ? Or a better way?

Please post a practical example of what's slow (with sample input) so we
can see, comment on, and test it. There's a Benchmark module that will
measure the performance of a function well.

Ted
 
N

nolo contendere

Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

check out /REGEX/o

and qr/REGEX/

...also, if you keep a history of which filters get used the most,
stick those at the top. this will speed up the file processing if the
trend does not change. may want to do this periodically in case it
does change.
 
U

Uri Guttman

nc> check out /REGEX/o

obsolete and probably useless.

nc> and qr/REGEX/

we still haven't seen his code so that is not a solution. more likely
his loops are clunky and slow and his regexes are worse.

nc> ...also, if you keep a history of which filters get used the most,
nc> stick those at the top. this will speed up the file processing if the
nc> trend does not change. may want to do this periodically in case it
nc> does change.

or which are the slowest regexes and speed those up. there are too many
ways to optimize unknown code. let's see if the OP will actually post
some data and code.

uri
 
N

nolo contendere

  nc> check out /REGEX/o

obsolete and probably useless.

really? is this since 5.10?
  nc> and qr/REGEX/

we still haven't seen his code so that is not a solution. more likely
his loops are clunky and slow and his regexes are worse.

  nc> ...also, if you keep a history of which filters get used the most,
  nc> stick those at the top. this will speed up the file processing if the
  nc> trend does not change. may want to do this periodically in case it
  nc> does change.

or which are the slowest regexes and speed those up. there are too many
ways to optimize unknown code. let's see if the OP will actually post
some data and code.

yeah, Xho already suggested the speed-up-the-slowest-regex solution,
so I was going for something different. you're right though, code +
data would help enormously.
 
U

Uri Guttman

nc> really? is this since 5.10?

since at least when qr// came in. also dynamic regexes (those with
interpolation) are not recompiled unless some variable in them
changes. this is what /o was all about in the early days of perl5. so
it's purpose of not recompiling has been moot for eons. and qr// even
makes it even more useless.

uri
 
N

nolo contendere

  >>
  >>   nc> check out /REGEX/o
  >>
  >> obsolete and probably useless.
  >>

  nc> really? is this since 5.10?

since at least when qr// came in. also dynamic regexes (those with
interpolation) are not recompiled unless some variable in them
changes. this is what /o was all about in the early days of perl5. so
it's purpose of not recompiling has been moot for eons. and qr// even
makes it even more useless.

Ok, thanks for the info.
 
M

Martien Verbruggen

If your program is I/O bound, then it might be faster to work on
different parts simultaneously.

If the process is I/O bound, then it's unlikely that it'll speed up if
you work on multiple parts simultaneously, unles you can guarantee that
those multiple parts are going to come from a different part of your I/O
subsystem, i.e. ones that don't compete with each other for resources.
Given that it's one single file as input, it's very unlikely that you'll
be able to pick your parts to work on in such a way that you avoid I/O
contention.

You might see some improvement if you're lucky, but you could also see a
marked decrease in total I/O speed, if you're unlucky.

Splitting a process in multiple worker processes generally only is
better if each worker process can then utilise a piece of hardware that
wasn't used before, like another I/O system, or another CPU.
However, you are going to suffer some
head thrashing as your multiple processes attempt to read different
parts of the same file at the same time.

Indeed, at least, if your file is on a single disk. If it's on a RAID
system, the O/S might be able to avoid contention on disks. Or not. For
linear access patterns you generally do get some improvement.
If your program is cpu bound, then splitting up the work won't help
unless you are using a multi-processor system.

Indeed.

But CPU bound processes can benefit from algorithm improvements, or even
small tweaks to code if that code is in a place that gets executed a
lot.

Profiling would be able to identify that.
If, as you say, reading the file without doing any processing is quick
enough, then it is the processing of the data that is the bottleneck.

Agree :) It also is really the only bit which is likely to be
Perl-specific. All the previous is not.

Martien
 
U

Uri Guttman

b> I won't ask you lots of questions - but do you have a link
b> to this info that I can read - it's of (substantial) interest
b> to me.

this should be in perlop under the regexp quote like ops but it doesn't
mention that /o is useless now. the faq covers it. and 5.6 is pretty old
so /o has been useless for years.


perlfaq6: What is /o really for? (code snipped)

The /o option for regular expressions (documented in perlop and
perlreref) tells Perl to compile the regular expression only once. This
is only useful when the pattern contains a variable. Perls 5.6 and later
handle this automatically if the pattern does not change.

Since the match operator m//, the substitution operator s///, and the
regular expression quoting operator qr// are double-quotish constructs,
you can interpolate variables into the pattern. See the answer to "How
can I quote a variable to use in a regex?" for more details.

Versions of Perl prior to 5.6 would recompile the regular expression for
each iteration, even if $pattern had not changed. The /o would prevent
this by telling Perl to compile the pattern the first time, then reuse
that for subsequent iterations:

In versions 5.6 and later, Perl won't recompile the regular expression
if the variable hasn't changed, so you probably don't need the /o
option. It doesn't hurt, but it doesn't help either. If you want any
version of Perl to compile the regular expression only once even if the
variable changes (thus, only using its initial value), you still need
the /o.

You can watch Perl's regular expression engine at work to verify for
yourself if Perl is recompiling a regular expression. The use re 'debug'
pragma (comes with Perl 5.005 and later) shows the details. With Perls
before 5.6, you should see re reporting that its compiling the regular
expression on each iteration. With Perl 5.6 or later, you should only
see re report that for the first iteration.

uri
 
C

comp.llang.perl.moderated

Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.
...

Just a guess but splitting into pieces and then applying the regex
to each piece may well be a signifcant slowdown. Have you considered
trying to tweak the regex to avoid the split and resultant copies...
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Uri Guttman
In versions 5.6 and later, Perl won't recompile the regular expression
if the variable hasn't changed, so you probably don't need the /o
option. It doesn't hurt, but it doesn't help either.

Yet another case of broken documentation. Still, //o helps (though
nowhere as dramatically as before). It avoids CHECKING that the
pattern did not change.

Hope this helps,
Ilya
 
J

John Bokma

Ilya Zakharevich said:
Yet another case of broken documentation.

Important question: how can this be fixed?

Preferable both:

- the documentation itself,
- and a way to make the fixing process easier (wiki?)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top