Best Practice for Multiline Regexps

L

Lance Pollard

Hey,

What is the recommended way to parse a tree of files and replace
multiline pattern-matches, if you have say 20 regular expressions you're
looking for. I understand how to traverse directories, read/write
files, and use complex regular expressions, but the question is, what's
the optimal/recommended way to parse files (find/replace "def
method_name ...(some lines)... end", with some string, for instance)?
Is it:

1) Read file to String, match string against first pattern, read next
file into String, match with same pattern... once I've gone through all
the files with the first pattern, start over with the next pattern.
2) Read file to String, match whole string against all 20 patterns, go
to next file, match against 20 patterns...
3) Read file to String, match each line by 20 patterns...
4) Something with a Tokenizer which I don't yet understand (if so, could
you shed some light on it for me :) )

I basically want to write a few patterns to replace multiline text
patterns in lots of files, and need a consistent/fast way to do it in
ruby, without learning C or anything.

Thanks so much for the help,
Lance
 
L

Lance Pollard

I have read through the TextMate docs and have looked extensively at
their Language Grammars source files (ruby, javascript, actionscript,
etc.), which suggests using "begin" and "end" patterns, so that makes
sense, I'm just not sure how I run through the string if I have so many
befores and ends. The simplest solution would be to just look for one
pattern in a file at a time, but that seems like it'd be real slow.

This'll be nice to know for code parsing, and for code generation.

Thanks again.
 
R

Robert Klemme

What is the recommended way to parse a tree of files and replace
multiline pattern-matches, if you have say 20 regular expressions you're
looking for. I understand how to traverse directories, read/write
files, and use complex regular expressions, but the question is, what's
the optimal/recommended way to parse files (find/replace "def
method_name ...(some lines)... end", with some string, for instance)?
Is it:

1) Read file to String, match string against first pattern, read next
file into String, match with same pattern... once I've gone through all
the files with the first pattern, start over with the next pattern.

That's the worst you can do.
2) Read file to String, match whole string against all 20 patterns, go
to next file, match against 20 patterns...

Most efficient of the simple approaches.
3) Read file to String, match each line by 20 patterns...

How do you want to do that with multiline patterns? Is there an easy
way to convert them to several line based patterns? If not, you can
forget this option.
4) Something with a Tokenizer which I don't yet understand (if so, could
you shed some light on it for me :) )


Which Tokenizer are you referring to? Can you give more detail about
your patterns or the kind of replacement you want to do? If not (e.g.
because they must be generic) the simplest and most efficient seems to
be option 2 unless we are talking about GB file sizes.

Kind regards

robert
 
S

spiralofhope

That's the worst you can do.


Most efficient of the simple approaches.

Robert is right from a hard drive perspective.

To understand why method 2 works well - just remember that when a file
is read from your disk, it is cached. Your first solution would force
the system to cache file 1, process it, then cache file 2 to process
it.. through to caching file n and then back to file 1 again, cycling
through each pattern you're searching for for every file. It would be
horribly taxing on disk access.

But if we imagine that all the files are on a ramdisk, then could the
first method possibly be better? Something in the back of my mind
says method 2 is still better.
 
L

Lance Pollard

Thanks a lot Robert for your help.
Which Tokenizer are you referring to? Can you give more detail about
your patterns or the kind of replacement you want to do?

I would like to be able to do code generation in a kind of preprocessor
way for Actionscript, but I don't want to use a preprocessor because it
seems like too much work (especially in actionscript or java), and I'd
have to do pattern matching anyway :).

For example, I'd like to convert this:

[Bindable]
public function get myProperty():String {
return _myProperty;
}
public function set myProperty(value:String):void {
_myProperty = value;
}

... or this:

[Bindable] public var myProperty:String;

... or anything in between (different formatting...), to this:

[Bindable(event="myPropertyChange")]
public function get myProperty():String {
return _myProperty;
}
public function set myProperty(value:String):void {
_myProperty = value;
dispatchEvent(new Event("myPropertyChange"));
}

.... That accessor little snippet can be from 1 to 20+ lines, and I'd
like to be able to just add that line in there without having to run
through the file a bunch of times.

There's a few other examples similar to that that I'd like to do to, but
that's the gist of it.

Lance
 
R

Robert Klemme

2009/9/1 spiralofhope said:
Robert is right from a hard drive perspective.

To understand why method 2 works well - just remember that when a file
is read from your disk, it is cached. =A0Your first solution would force
the system to cache file 1, process it, then cache file 2 to process
it.. through to caching file n and then back to file 1 again, cycling
through each pattern you're searching for for every file. =A0It would be
horribly taxing on disk access.
Exactly!

But if we imagine that all the files are on a ramdisk, then could the
first method possibly be better? =A0Something in the back of my mind
says method 2 is still better.

Yes, and here's why: you save the effort of transferring file data
from the disk into Ruby's address space (memory mapped IO is OS
dependent and might not be available - also, I believe it's not a core
lib functionality). Additionally: it is unlikely that all files
reside on a ramdisk regularly so you have the additional effort of
moving / copying files there.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
R

Robert Klemme

2009/9/1 Lance Pollard said:
Thanks a lot Robert for your help.
Which Tokenizer are you referring to? =A0Can you give more detail about
your patterns or the kind of replacement you want to do?

I would like to be able to do code generation in a kind of preprocessor
way for Actionscript, but I don't want to use a preprocessor because it
seems like too much work (especially in actionscript or java), and I'd
have to do pattern matching anyway :).

For example, I'd like to convert this:

[Bindable]
public function get myProperty():String {
=A0 =A0 return _myProperty;
}
public function set myProperty(value:String):void {
=A0 =A0 _myProperty =3D value;
}

... or this:

[Bindable] public var myProperty:String;

... or anything in between (different formatting...), to this:

[Bindable(event=3D"myPropertyChange")]
public function get myProperty():String {
=A0 =A0 return _myProperty;
}
public function set myProperty(value:String):void {
=A0 =A0 _myProperty =3D value;
=A0 =A0 dispatchEvent(new Event("myPropertyChange"));
}

.... That accessor little snippet can be from 1 to 20+ lines, and I'd
like to be able to just add that line in there without having to run
through the file a bunch of times.

There's a few other examples similar to that that I'd like to do to, but
that's the gist of it.

Ah, I see. I assume your files are not big (looks like program code
which typically does not come in GB's.) You might be able to do with
single line regular expressions but I'd first try String#gsub! because
that's typically the most efficient thing to do. Maybe you can even
do this without the block form, e.g.

# incomplete
contents.gsub! %r{
(
\[Bindable)(\]
\s*
public
\s+
function
\s+
get
\s+)
(\w+) # property name $3
\s*
\([^)]*\)
....
}x, '\\1(event=3D"\\3Change")\\2\\3...'

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
L

Lance Pollard

# incomplete
contents.gsub! %r{
(
\[Bindable)(\]
\s*
public
\s+
function
\s+
get
\s+)
(\w+) # property name $3
\s*
\([^)]*\)
....
}x, '\\1(event="\\3Change")\\2\\3...'

Kind regards

robert

Nice! I'll try that out. That looks real clean. Thanks a lot for your
guys' help.

Lance.
 
R

Robert Klemme

2009/9/2 Shot (Piotr Szotkowski) said:
Lance Pollard:



The only advantage of 1) over 2) is that if you have a situation where
a combination of pattern 18 with file X breaks (because the pattern
didn=92t foresee some peculiar corner case which manifests only in file
X), then =96 assuming you take snapshots of the operation after every
pattern replace is over =96 the first approach will let you keep the work
done by the first 17 pattern replacements and re-run the process from
pattern 18 on, while the second approach will mean you have to roll back
everything and redo all of the work that succeeded one time already.

Excellent point! With a complex operation like this I would keep
backups of all original files anyway. Maybe even copy the whole
directory tree before changing anything. That way you can check with
diff what changed, roll back etc.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top