Shrink large file according to REG_EXP

thellper · Jan 16, 2008

Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

Any help is really appreciated.

Best regards,
Davide

xhoster · Jan 16, 2008

thellper said:
Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Figure out which regex is slow, why it is slow, and then make it faster.

If you did the first step and posted the culprit with some sample input, we
might be able to help with the latter two.

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

I'd try to make the single-threaded one faster first, and resort to
parallelization only as a last resort. Also, if I were doing
parallelization of this, I probably wouldn't use forks.pm to do it. Once
started, your threads (or processes) really don't need to communicate with
each other (as long as you make independent output files to be combined
later) , so a simpler solution, like Parallel::ForkManager or just doing
fork yourself. Or just start the jobs as separate processes in the first
place.

If the orders of the lines in the output files isn't important, I'd give
each job a different integer token (from 0 to num_job-1) and then have each
job process only those lines where
$token == $. % $num_job

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Ted Zlatanov · Jan 16, 2008

t> The problem is that this solution is slow. I'm now reading line by
t> line the whole file, and then I'm applying the reg_exp... but it is
t> very slow. I've noticed that the time to read and write the file
t> without doing anything is very small, so I'm loosing a lot of time
t> for my reg_exps... .

t> Ok, the whole program is more complicated: the files may have
t> different syntax, and I have syntax files which tell me how to split
t> each line in its fields. Then I load separately files with the rules
t> (the reg_exps) used to filter them.... . Anyway, my idea was to try
t> to use the FORKS.pm module (s. CPAN) to split the file in chunks and
t> let each thread work on a chunk of the file: can somebody tell me how
t> to do this ? Or a better way?

Please post a practical example of what's slow (with sample input) so we
can see, comment on, and test it. There's a Benchmark module that will
measure the performance of a function well.

Ted

nolo contendere · Jan 16, 2008

Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

check out /REGEX/o

and qr/REGEX/

...also, if you keep a history of which filters get used the most,
stick those at the top. this will speed up the file processing if the
trend does not change. may want to do this periodically in case it
does change.

Uri Guttman · Jan 16, 2008

nc> check out /REGEX/o

obsolete and probably useless.

nc> and qr/REGEX/

we still haven't seen his code so that is not a solution. more likely
his loops are clunky and slow and his regexes are worse.

nc> ...also, if you keep a history of which filters get used the most,
nc> stick those at the top. this will speed up the file processing if the
nc> trend does not change. may want to do this periodically in case it
nc> does change.

or which are the slowest regexes and speed those up. there are too many
ways to optimize unknown code. let's see if the OP will actually post
some data and code.

uri

nolo contendere · Jan 16, 2008

nc> check out /REGEX/o

obsolete and probably useless.

really? is this since 5.10?

nc> and qr/REGEX/

we still haven't seen his code so that is not a solution. more likely
his loops are clunky and slow and his regexes are worse.

nc> ...also, if you keep a history of which filters get used the most,
nc> stick those at the top. this will speed up the file processing if the
nc> trend does not change. may want to do this periodically in case it
nc> does change.

or which are the slowest regexes and speed those up. there are too many
ways to optimize unknown code. let's see if the OP will actually post
some data and code.

yeah, Xho already suggested the speed-up-the-slowest-regex solution,
so I was going for something different. you're right though, code +
data would help enormously.

Uri Guttman · Jan 16, 2008

nc> really? is this since 5.10?

since at least when qr// came in. also dynamic regexes (those with
interpolation) are not recompiled unless some variable in them
changes. this is what /o was all about in the early days of perl5. so
it's purpose of not recompiling has been moot for eons. and qr// even
makes it even more useless.

uri

nolo contendere · Jan 16, 2008

>>
>> nc> check out /REGEX/o
>>
>> obsolete and probably useless.
>>

nc> really? is this since 5.10?

since at least when qr// came in. also dynamic regexes (those with
interpolation) are not recompiled unless some variable in them
changes. this is what /o was all about in the early days of perl5. so
it's purpose of not recompiling has been moot for eons. and qr// even
makes it even more useless.

Ok, thanks for the info.

Martien Verbruggen · Jan 16, 2008

If your program is I/O bound, then it might be faster to work on
different parts simultaneously.

If the process is I/O bound, then it's unlikely that it'll speed up if
you work on multiple parts simultaneously, unles you can guarantee that
those multiple parts are going to come from a different part of your I/O
subsystem, i.e. ones that don't compete with each other for resources.
Given that it's one single file as input, it's very unlikely that you'll
be able to pick your parts to work on in such a way that you avoid I/O
contention.

You might see some improvement if you're lucky, but you could also see a
marked decrease in total I/O speed, if you're unlucky.

Splitting a process in multiple worker processes generally only is
better if each worker process can then utilise a piece of hardware that
wasn't used before, like another I/O system, or another CPU.

However, you are going to suffer some
head thrashing as your multiple processes attempt to read different
parts of the same file at the same time.

Indeed, at least, if your file is on a single disk. If it's on a RAID
system, the O/S might be able to avoid contention on disks. Or not. For
linear access patterns you generally do get some improvement.

If your program is cpu bound, then splitting up the work won't help
unless you are using a multi-processor system.

Indeed.

But CPU bound processes can benefit from algorithm improvements, or even
small tweaks to code if that code is in a place that gets executed a
lot.

Profiling would be able to identify that.

If, as you say, reading the file without doing any processing is quick
enough, then it is the processing of the data that is the bottleneck.

Agree

It also is really the only bit which is likely to be
Perl-specific. All the previous is not.

Martien

Uri Guttman · Jan 17, 2008

b> I won't ask you lots of questions - but do you have a link
b> to this info that I can read - it's of (substantial) interest
b> to me.

this should be in perlop under the regexp quote like ops but it doesn't
mention that /o is useless now. the faq covers it. and 5.6 is pretty old
so /o has been useless for years.

perlfaq6: What is /o really for? (code snipped)

The /o option for regular expressions (documented in perlop and
perlreref) tells Perl to compile the regular expression only once. This
is only useful when the pattern contains a variable. Perls 5.6 and later
handle this automatically if the pattern does not change.

Since the match operator m//, the substitution operator s///, and the
regular expression quoting operator qr// are double-quotish constructs,
you can interpolate variables into the pattern. See the answer to "How
can I quote a variable to use in a regex?" for more details.

Versions of Perl prior to 5.6 would recompile the regular expression for
each iteration, even if $pattern had not changed. The /o would prevent
this by telling Perl to compile the pattern the first time, then reuse
that for subsequent iterations:

In versions 5.6 and later, Perl won't recompile the regular expression
if the variable hasn't changed, so you probably don't need the /o
option. It doesn't hurt, but it doesn't help either. If you want any
version of Perl to compile the regular expression only once even if the
variable changes (thus, only using its initial value), you still need
the /o.

You can watch Perl's regular expression engine at work to verify for
yourself if Perl is recompiling a regular expression. The use re 'debug'
pragma (comes with Perl 5.005 and later) shows the details. With Perls
before 5.6, you should see re reporting that its compiling the regular
expression on each iteration. With Perl 5.6 or later, you should only
see re report that for the first iteration.

uri

comp.llang.perl.moderated · Jan 17, 2008

Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.
...

Just a guess but splitting into pieces and then applying the regex
to each piece may well be a signifcant slowdown. Have you considered
trying to tweak the regex to avoid the split and resultant copies...

Ilya Zakharevich · Jan 17, 2008

[A complimentary Cc of this posting was sent to
Uri Guttman

In versions 5.6 and later, Perl won't recompile the regular expression
if the variable hasn't changed, so you probably don't need the /o
option. It doesn't hurt, but it doesn't help either.

Yet another case of broken documentation. Still, //o helps (though
nowhere as dramatically as before). It avoids CHECKING that the
pattern did not change.

Hope this helps,
Ilya

John Bokma · Jan 17, 2008

Ilya Zakharevich said:
Yet another case of broken documentation.

Important question: how can this be fixed?

Preferable both:

- the documentation itself,
- and a way to make the fixing process easier (wiki?)

Why should I split large MBOX files?	0	Mar 31, 2026
sorting file according to a unicode column	17	May 28, 2014
What risks of data corruption are reduced by divide large size PST files?	0	Oct 13, 2025
How do I fix Error 1028: Insufficient Memory in IBM Notes when opening a large NSF file?	0	Feb 19, 2026
Optimal way to make a table for large lists	2	Jul 7, 2022
How can I import a PST file to an IMAP server easily?	3	Mar 18, 2026
How to Convert PST to EML File Easily?	4	Mar 26, 2026
How to View an OST file without Microsoft Outlook?	5	Nov 18, 2024

Shrink large file according to REG_EXP

thellper

xhoster

Ted Zlatanov

nolo contendere

Uri Guttman

nolo contendere

Uri Guttman

nolo contendere

Martien Verbruggen

Uri Guttman

comp.llang.perl.moderated

Ilya Zakharevich

John Bokma

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads