Editing files with File::Find

J

jl_post

Dear Perl community,

A handful of times I've had to edit certain files in a given
directory. For example, I might have to append a blank line to all
*.txt files localted in a certain directory (and all its
subdirectories).

For this task, I use the File::Find module, like this:

use File::Find;
find(\&wanted, @directories_to_search);
sub wanted { ... }

As for the wanted function, I could define it like this:

sub wanted
{
# Skip non-text files:
return unless -f and m/\.txt$/;

# Append a newline to text file:
open(OUT, ">>$_") or die "Error with '$_': $!";
print OUT "\n";
close(OUT);
}

or I could define it like this:

sub wanted
{
# Skip non-text files:
return unless -f and m/\.txt$/;

# Rename file to *.bak:
rename($_, "$_.bak") or die "Cannot rename '$_': $!";

# Read *.bak file into *.txt file and add newline:
open(IN, "<$_.bak") or die "Error reading '$_': $!";
open(OUT, ">$_") or die "Error writing to '$_': $!";
print OUT <IN>, "\n";
close(OUT);
close(IN);

unlink("$_.bak") or die "Cannot unlink '$_': $!";
}


In the first definition of &wanted, I simply append to the file
without creating a back-up file. In the second version of &wanted, I
create a back-up file that ends in *.bak .

My main question is this: Since both versions of &wanted modify the
files they find, will they ruin (or throw off) the File::Find
algorithm?

In the first definition of &wanted, no extra files were created, yet
one was modified. Will this cause File::Find to "pick up" the modified
file as a new file and modify it again?

In the second definition of &wanted, an extra file was created for
every file modified. The first file was not deleted, but it was
entirely overwritten.

Does anyone have any input on this? I read through "perldoc
File::Find" for mention of "re-picking-up" file entries that were
already modified, but I couldn't find any.

I'm wondering if this is a machine-specific issue. If it is, it
would probably be safest if I wrote the &wanted function like this:

my @files;
sub wanted
{
# Skip non-text files:
return unless -f and m/\.txt$/;

push @files, $File::Find::name;
}

and then loop through the @files array, like this:

foreach (@files)
{
# Append a newline to text file:
open(OUT, ">>$_") or die "Error with '$_': $!";
print OUT "\n";
close(OUT);
}

That way I won't modify anything until I found all the files to
process.

So should I stick to the last method I mentioned, or am I worrying
for nothing? (Or is there a better method I'm not aware of yet?)

Thanks for any input.

-- Jean-Luc
 
D

Darren Dunham

In the first definition of &wanted, I simply append to the file
without creating a back-up file. In the second version of &wanted, I
create a back-up file that ends in *.bak .
My main question is this: Since both versions of &wanted modify the
files they find, will they ruin (or throw off) the File::Find
algorithm?
In the first definition of &wanted, no extra files were created, yet
one was modified. Will this cause File::Find to "pick up" the modified
file as a new file and modify it again?

No. That's completely safe.
In the second definition of &wanted, an extra file was created for
every file modified. The first file was not deleted, but it was
entirely overwritten.

Probably depends on the filesystem. In my tests, the rename step always
moved the file to a new position in the directory, so when the new file
is created, it was always created prior to the point at which File::Find
had already reached. So the new file was never scanned.

I don't know if I'd really want to rely on that behavior though.
I'm wondering if this is a machine-specific issue. If it is, it
would probably be safest if I wrote the &wanted function like this:
my @files;
sub wanted
{
# Skip non-text files:
return unless -f and m/\.txt$/;
push @files, $File::Find::name;
}
and then loop through the @files array, like this:
foreach (@files)
{
# Append a newline to text file:
open(OUT, ">>$_") or die "Error with '$_': $!";
print OUT "\n";
close(OUT);
}
That way I won't modify anything until I found all the files to
process.

That would also be good (and there might be times that I would prefer
it). But if you can use the append method (which is much simpler and
faster anyway), then you're not modifying filenames or the directory.
File::Find won't be confused.

Here's a question, what happens if your program or machine crashes in
the middle of changing a particular directory? You'll have done some
files but not others. Would you want to just append another newline
onto files that have already been processed? It might be best to check
the end of the file and see if a newline is present, and add one only if
it's not. That would get around any crash problems as well as a
situation where File::Find might hand you a file twice due to a
directory reorganization or somesuch.
 
J

jl_post

Thanks for the quick reply, Darren!

Darren said:
No. That's completely safe.

Out of curiosity, how do you know? Did you read it in some
documentation, or do you know how File::Find works internally? If you
read about it in some perldoc somewhere, I'd like to read it, too.
(Hopefully it can clear up more on the subject for me.)
Probably depends on the filesystem. In my tests, the
rename step always moved the file to a new position in
the directory, so when the new file is created, it was
always created prior to the point at which File::Find
had already reached. So the new file was never scanned.

I don't know if I'd really want to rely on that behavior
though.


I know what you mean about not relying on that behavior. This
brings up another method I didn't bring up in the first post, and
that's to write a wanted function that accepts every text file and
creates a *.bak file (that won't get deleted), like this:

sub wanted
{
# Skip non-text files:
return unless -T;

# Rename file to *.bak:
rename($_, "$_.bak") or die "Cannot rename '$_': $!";

# Read *.bak file into original file and add newline:
open(IN, "<$_.bak") or die "Error reading '$_': $!";
open(OUT, ">$_") or die "Error writing to '$_': $!";
print OUT <IN>, "\n";
close(OUT);
close(IN);
}


This &wanted function will process every text file (whether it ends
in '.txt' or not) and create a corresponding *.bak file (which it never
deletes, in case the changes later need to be undone).

What I worry about is that the &wanted function may find one of the
*.bak files and process it, creating a *.bak.bak file which it will
find and process and then create a *.bak.bak.bak file, and so on...

According to your tests, the rename step always moved the file to a
new position in the directory so that when the new file was created, it
was always created prior to the point at which File::Find had already
reached. Assuming that behavior is consistent across all platforms,
then the above solution would work, as it won't read any of the
newly-made *.bak files. But like you said, we might not want to rely
on that behavior, though...

Here's a question, what happens if your program or
machine crashes in the middle of changing a particular
directory? You'll have done some files but not others.
Would you want to just append another newline onto
files that have already been processed? It might be
best to check the end of the file and see if a newline
is present, and add one only if it's not. That would
get around any crash problems as well as a situation
where File::Find might hand you a file twice due to a
directory reorganization or somesuch.

You bring up a good point! I have to admit, though, that the
example I gave was hypothetical, as most of the changes I've had to do
involve looking for specific patterns of strings and replacing them
with similar but different strings (ones that don't show up in the
original files). I chose the task of appending a blank line for the
sake of example, since I didn't want to clutter the code with lots of
regular expressions.

But you put forward the excellent suggestion of only making a change
if it's needed, so as not to re-edit a file if the script was re-run on
certain files.

Thanks again, Darren.

-- Jean-Luc
 
D

Darren Dunham

Out of curiosity, how do you know? Did you read it in some
documentation, or do you know how File::Find works internally? If you
read about it in some perldoc somewhere, I'd like to read it, too.
(Hopefully it can clear up more on the subject for me.)

In the past I have read a bit of File::Find to add some functionality,
but no I'm just relying on my knowledge of how filesystems work and are
accessed in general.

It's going to be doing getdents() calls (via opendir/readdir) and in
some circumstances calling stat() on files within. To do anything else
would be harder and less portable. Opending a file for appending and
writing to it isn't going to change the output there in any way that
File::Find will notice (it would of course change size and some
timestamps).

A quick peek shows those calls in the module in a form that looks
reasonably like what I expect.

In my suspicion of the other form of your subroutine, my concern wasn't
with the Find module itself, but with how the core perl readdir() would
handle returning data about a directory while it was being updated.
I know what you mean about not relying on that behavior. This
brings up another method I didn't bring up in the first post, and
that's to write a wanted function that accepts every text file and
creates a *.bak file (that won't get deleted), like this:
sub wanted
{
# Skip non-text files:
return unless -T;
# Rename file to *.bak:
rename($_, "$_.bak") or die "Cannot rename '$_': $!";
# Read *.bak file into original file and add newline:
open(IN, "<$_.bak") or die "Error reading '$_': $!";
open(OUT, ">$_") or die "Error writing to '$_': $!";
print OUT <IN>, "\n";
close(OUT);
close(IN);
}

This &wanted function will process every text file (whether it ends
in '.txt' or not) and create a corresponding *.bak file (which it never
deletes, in case the changes later need to be undone).
What I worry about is that the &wanted function may find one of the
*.bak files and process it, creating a *.bak.bak file which it will
find and process and then create a *.bak.bak.bak file, and so on...
Yup.

According to your tests, the rename step always moved the file to a
new position in the directory so that when the new file was created, it
was always created prior to the point at which File::Find had already
reached. Assuming that behavior is consistent across all platforms,
then the above solution would work, as it won't read any of the
newly-made *.bak files. But like you said, we might not want to rely
on that behavior, though...

Right. I did one test on one version of Solaris/UFS. Some other
filesystem could "optimize" the rename so that a new slot was not taken,
and new files would be added later.
Here's a question, what happens if your program or
machine crashes in the middle of changing a particular
directory? You'll have done some files but not others.
Would you want to just append another newline onto
files that have already been processed?
[...]
You bring up a good point! I have to admit, though, that the
example I gave was hypothetical, as most of the changes I've had to do
involve looking for specific patterns of strings and replacing them
with similar but different strings (ones that don't show up in the
original files). I chose the task of appending a blank line for the
sake of example, since I didn't want to clutter the code with lots of
regular expressions.

Gotcha. 'idempotent'. As long as you can rerun it on the same file,
then that's much safer.
 
J

jl_post

Darren said:
Gotcha. 'idempotent'. As long as you can rerun it
on the same file, then that's much safer.


Okay, just to make sure I understand correctly, I'll give an
"idempotent" example: Replacing all text files with 1000 blank lines:


use File::Find;
find(\&wanted, @directories_to_search);
sub wanted
{
return unless -T; # skip non-text files

# Replace text file with 1000 blank lines:
open(OUT, ">$_") or die "Error with '$_': $!";
print OUT ("\n" x 1000);
close(OUT);
}


This code will modify every text file by changing them to be nothing
but 1000 newlines. It may decrease or increase the size of the file.
Just to make sure I understand correctly, am I correct in saying that
using this approach is safe in that modified files won't be picked up
again (and so won't result in an infinite loop)?

Thanks for the help you've already given me, by the way.

-- Jean-Luc
 
D

Darren Dunham

Okay, just to make sure I understand correctly, I'll give an
"idempotent" example: Replacing all text files with 1000 blank lines:

use File::Find;
find(\&wanted, @directories_to_search);
sub wanted
{
return unless -T; # skip non-text files
# Replace text file with 1000 blank lines:
open(OUT, ">$_") or die "Error with '$_': $!";
print OUT ("\n" x 1000);
close(OUT);
}

This code will modify every text file by changing them to be nothing
but 1000 newlines. It may decrease or increase the size of the file.

Right. And running it multiple times on a file will not change the
result.
Just to make sure I understand correctly, am I correct in saying that
using this approach is safe in that modified files won't be picked up
again (and so won't result in an infinite loop)?

This statement (truncating and writing to an existing file) is slightly
different from the other two you gave earlier (appending to an existing
file and creating a new file). Whether it's safe depends on exactly how
open for writing works in perl.

I see nothing that suggests that it does something wierd, so presumably
it's leaving the inodes and directories alone and just truncating the
data. If true, then there's nothing that could confuse the path
File::Find takes. Seems safe to me.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,072
Latest member
trafficcone

Latest Threads

Top