Reading whole file into memory. Parsing 'C' like file efficently

N

n_macpherson

I know there are a number of FAQs which disscourage reading whole
files into memory rather than line by line.

However my problem is as follows.

I am reading a file which is a language which looks like (but isn't )
C. I need to insert comments / documentation at various points in the
file. However sometimes I don't know what I want to insert until I get
well past the current line - for example


for(i=0;i<64;i++)
{
// lots of code
}

Say my opening brace is on line 95 and my closing brace 195 I want to
insert a comment

// for loop ends line 195

at line 94 (i.e immediately above the opening brace). The problem is
that processing line by line I don't know until I get to line 195 what
I have to change at line 9 so I have to store lines 94 to 195 in
memory anyway

Similarly if I read a function header, I want to insert some
documentation before the function header
so I don't believe processing the file line by line is the best
solution here. As I will be inserting extra lines into the middle of
an array I think I am going to need a module to do this.

Memory won't be an issue - my largest file will only be 6000

I've been away from Perl for a while but I seem to remember there was
a module File::Tie which might be suitable.

I'd be grateful if anyone has any suggestions - the people who will be
using this don't normally use Perl so I'd like to avoid using any non-
standard modules if possible

Thanks

Niall
 
J

Jürgen Exner

Similarly if I read a function header, I want to insert some
documentation before the function header
so I don't believe processing the file line by line is the best
solution here.

Based on what you said I would tend to agree.

If that kind of automated annotation is useful is a different story,
thou. I doubt it. Like for
Say my opening brace is on line 95 and my closing brace 195 I want to
insert a comment
// for loop ends line 195

First of all a proper indentation will provide even better guidance as
to where the loop ends. And second a single block spanning 100 lines is
just plain nuts. A classic rule of thumb used to be that if the code for
a sub doesn't fit on VT220 screen, then it was too long and you should
think about splitting it. There ware two reasons for this:
- you don't want to keep scrolling up and down while thinking about this
sub
- anyting much longer becomes too complex for a single sub

Granted, times have changed and typically you can display many more
lines on modern terminals. But the second reason is still very sound.
Many people will probably consider 30-50 lines of code to be the maximum
length of code that can still be easily viewed and recognized without
too much mental scrolling.
As I will be inserting extra lines into the middle of
an array I think I am going to need a module to do this.

Why? Sounds like a perfect job for splice().

jue
 
N

n_macpherson

First of all a proper indentation will provide even better guidance as
to where the loop ends. And second a single block spanning 100 lines is
just plain nuts. A classic rule of thumb used to be that if the code for
a sub doesn't fit on VT220 screen, then it was too long and you should
think about splitting it. There ware two reasons for this:
- you don't want to keep scrolling up and down while thinking about this
sub
- anyting much longer becomes too complex for a single sub

Granted, times have changed and typically you can display many more
lines on modern terminals. But the second reason is still very sound.
Many people will probably consider 30-50 lines of code to be the maximum
length of code that can still be easily viewed and recognized without
too much mental scrolling.

One of the reasons I am writing this script is because we have
introduced coding standards which specify a maximum of 300 lines per
function and 70 lines for a while/if/else/for loop and I need to
highlight places in our scripts where this occurs. I agree 300 lines
for a function is probably too long but in the language concerned
anything less than 200 would be completely impractical unfortunately.

The indentation is a good point - our developers mostly develop on
site which means a variety of editors ( UltraEdit, Visual Studio,
Notepad++, our own proprietary editor ) are used. This means
indentation across scripts becomes inconsistent. One of the functions
of the script I am writing will be to make sure the indentation
conforms to the coding standards.
Why? Sounds like a perfect job for splice().

Yes - I'd forgotten splice() will allow me to insert into the middle
of an array (as I said I have been away from Perl for a little
while) . That should work fine for my purposes.
 
X

xhoster

I know there are a number of FAQs which disscourage reading whole
files into memory rather than line by line.

I hope the discourage you from reading whole files into memory
thoughtlessly and without good reason. It seems like you do have a good
reason to read them into memory, so go ahead and do it. There is even a
module, File::Slurp, to facilitate it.

....
Memory won't be an issue - my largest file will only be 6000

Those are famous last words :)

I remember many times when I've said "it will only ever be X large" and
then had to eat those words. But of course, I suspect there are many many
more times that my statement held true and it never did get much larger,
but those ones don't force themselves back into your attention the way the
other ones do.
I've been away from Perl for a while but I seem to remember there was
a module File::Tie which might be suitable.

For 6000 lines of code, you should be a long long way from needing
Tie::File. In fact, last time I investigated it, the memory overhead for
Tie::File was so large that, unless your file's lines are very long, much
longer than one generally finds in a computer program, it provided little
memory benefit over slurping the file.
I'd be grateful if anyone has any suggestions -

Don't worry about this particular problem until it has proven itself
to be an issue (which it probably won't)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
B

Ben Morrow

Quoth (e-mail address removed):
(e-mail address removed) wrote:
[slurping a file into an array]
For 6000 lines of code, you should be a long long way from needing
Tie::File. In fact, last time I investigated it, the memory overhead for
Tie::File was so large that, unless your file's lines are very long, much
longer than one generally finds in a computer program, it provided little
memory benefit over slurping the file.

One major advantage of Tie::File is that the interface is exactly the
same as a slurped array, so if/when memory does become a problem, you
can simply replace

use File::Slurp qw/read_file/;

my @data = read_file 'name';

with

use Tie::File;

tie my @data, 'Tie::File', 'name' or die "can't read 'name': $!";

and leave the rest of the code unchanged.

Ben
 
X

xhoster

Ben Morrow said:
Quoth (e-mail address removed):
(e-mail address removed) wrote:
[slurping a file into an array]
For 6000 lines of code, you should be a long long way from needing
Tie::File. In fact, last time I investigated it, the memory overhead
for Tie::File was so large that, unless your file's lines are very
long, much longer than one generally finds in a computer program, it
provided little memory benefit over slurping the file.

One major advantage of Tie::File is that the interface is exactly the
same as a slurped array, so if/when memory does become a problem, you
can simply replace

use File::Slurp qw/read_file/;

my @data = read_file 'name';

This uses 3 times as much memory as reading in the file in a while loop
and pushing it into the array. It seems like it should only be two times
as much, but it isn't (And it is 1.5 times as much @data=<$fh> takes). Of
course, most of that excess memory is eligible for later reuse, provided
your program survives and needs it.
with

use Tie::File;

tie my @data, 'Tie::File', 'name' or die "can't read 'name': $!";

and leave the rest of the code unchanged.

But my lament is that this just doesn't save all that much memory over
an already efficient slurping method, due to the overhead of Tie::File's
internal structures. I checked again on the latest Tie::File, and based on
vague recollections it does seem substantially better than the older one I
played around with, but still the memory overhead is not an insignificant
fraction of what it would be to just slurp a large file of short lines. So
I consider Tie::File to be an emergency measure I'd throw at a program to
keep it limping along while I redesign and rewrite. (Not that there is
anything wrong with that)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
C

cartercc

Say my opening brace is on line 95 and my closing brace 195 I want to
insert a comment

// for loop ends line 195

at line 94 (i.e immediately above the opening brace). The problem is
that processing line by line I don't know until I get to line 195 what
I have to change at line 9 so I have to store lines 94 to 195 in
memory anyway

Similarly if I read a function header, I want to insert some
documentation before the function header
so I don't believe processing the file line by line is the best
solution here. As I will be inserting extra lines into the middle of
an array I think I am going to need a module to do this.

I might approach this by matching delimiters. You can certainly match
delimiters and insert comments just above the opening brace. If you
match on key words (for, while, if, else, etc.) and count your lines,
you can create an intermediate file with a comment template just above
the opening brace, and then manually edit for the final program.
Something like this, maybe:

my $line_counter
my @brace_stack #holds info about your block
while(<INFILE>)
if $_ matches '{'
$line_counter++
push $brace_stack[n]
print OUTFILE "// COMMENT"
print OUTFILE $_
if $_ matches '}'
$line_counter--
pop $brace_stack[n]
print OUTFILE $_
print OUTFILE "// COMMENT"

Obviously, your logic would depend on your coding standard. I wrote
something similar in Java and developed a class that would do
something similar. Perl ought to be a lot easier.

CC
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top