Trouble with embedded whitespace in filenames using File::Find

C

Clint O

The following program I wrote I'm using to find duplicate files. The problem is that I have files with whitespace or potentially other special characters:

#!/opt/local/bin/perl

use Digest::MD5;
use File::Find;
use Data::Dumper;

use strict;
use warnings;

my %results = ();

sub do_file;

my @files = @ARGV;

exit 1 if !@files;

find(sub { do_file(\%results) }, @files );

for (keys %results) {
my @f = @{$results{$_}};

if (scalar @f > 1) {
print "$f[0] => $f[1]\n";
}
}

sub do_file {
my ($hash) = @_;
return if -d $_;

open(my $fh, $_) or die "Can't open '$File::Find::name': $!";
binmode $fh;

my $digest;

$digest = Digest::MD5->new->addfile($fh)->hexdigest;
close $fh;

push @{$hash->{$digest}}, $File::Find::name;
}

0;

If I create a test directory:

$ mkdir test_dir
$ cd test_dir
$ touch " my file"
$ ./dupcheck testdir
Can't open 'testdir/ my file': No such file or directory at ./dupcheck line 32.

I can't be the first one who has run into this problem, and I'm sure there's a reasonable explanation for how to cope with this, but I haven't been able to find anything via the searching etc. on the web.

Thanks,

-Clint
 
C

Clint O

You created



"test_dir/my file"

^



and you're trying to open



"testdir/ my file".

^



It's not there, so the program complains.

Well, that "test_dir" is clearly a typo. This program would have never generated this output with a non-existent directory:

$ ./dupcheck /asfasfasdfasdf
Can't stat /asfasfasdfasdf: No such file or directory
at ./dupcheck line 18

Anyway, my issue still stands. I cannot open a local file with embedded whitespace.

Thanks,

-Clint
 
R

Rainer Weikusat

[...]
The following program I wrote I'm using to find duplicate files. The
problem is that I have files with whitespace or potentially other
special characters:
[...]

open(my $fh, $_) or die "Can't open '$File::Find::name': $!";

Since you didn't specify an explicit open mode, perl parses $_ in
order to look for one and it skips leading whitespace, cf

The filename passed to 2-argument (or 1-argument) form of
open() will have leading and trailing whitespace deleted, and
the normal redirection characters honored.
[perldoc -f open]

Using open($fh, '<', $_) instead works.

BTW: Assuming you're running this as root, someone who doesn't like
you could create a file named |rm -rf `printf "\x2f"` and you probably
wouldn't like the result of trying to open that.

NB: DO NOT TRY THIS. Except if I made an error, this will execute rm
-rf / with the privileges of the invoker.

More harmless: td/|ls `printf "..\x2f"`. This will list the contents
of the directory above td.
 
C

Clint O

Since you didn't specify an explicit open mode, perl parses $_ in

order to look for one and it skips leading whitespace, cf



The filename passed to 2-argument (or 1-argument) form of

open() will have leading and trailing whitespace deleted, and

the normal redirection characters honored.

[perldoc -f open]



Using open($fh, '<', $_) instead works.



BTW: Assuming you're running this as root, someone who doesn't like

you could create a file named |rm -rf `printf "\x2f"` and you probably

wouldn't like the result of trying to open that.



NB: DO NOT TRY THIS. Except if I made an error, this will execute rm

-rf / with the privileges of the invoker.



More harmless: td/|ls `printf "..\x2f"`. This will list the contents

of the directory above td.

Ok, thanks for the tip and the heads-up. I am running the program as root on a NAS, and the files are created by my family, but just as a good FYI, are there ways I can protect myself against malicious code? Running as root ensures I can read all the files w/o question. I've used Safe before, but I'm not sure whether it's necessary or appropriate for this application.

Thanks,

-Clint
 
J

Jürgen Exner

Clint O said:
Well, that "test_dir" is clearly a typo.

So, you should be thankful that Clint found that typo and pointed it out
to, right?
Anyway, my issue still stands. I cannot open a local file with embedded whitespace.

Well, nobody claimed that there is only on issue in your program.

jue
 
C

Clint O

So, you should be thankful that Clint found that typo and pointed it out

to, right?






Well, nobody claimed that there is only on issue in your program.

Well, if you're going to critique my program and bother to post a reply, atleast make it relevant. People request that you post entire scripts so that the problem can be seen by others. I did due diligence by posting the script and made a mistake in the testcase.

-Clint
 
J

Jürgen Exner

Clint O said:
Well, if you're going to critique my program and bother to post a reply, at least make it relevant. People request that you post entire scripts so that the problem can be seen by others. I did due diligence by posting the script and made a mistake in the testcase.

Ok, because you explicitely asked for it:
- is there a specific reason why you are adding an empty line after
every line you quote? That doesn't improve readability one bit and makes
quoting your post rather tedious.
- Is there a specific reason why your lines are longer than the usual
70-75 characters?

jue
 
C

Clint O

Ok, because you explicitely asked for it:

- is there a specific reason why you are adding an empty line after

every line you quote? That doesn't improve readability one bit and makes

quoting your post rather tedious.

- Is there a specific reason why your lines are longer than the usual

70-75 characters?

I'm guessing these might be artifacts of Google Groups web interface. That's what I'm using to read the group. It's hard(er) to control the formattingof my responses. Coming from a hard-nosed slrn background, I agree that itis annoying, and if I can figure it out I will fix it.

Thanks,

-Clint
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Clint O said:
Ok, thanks for the tip and the heads-up. I am running the program as
root on a NAS, and the files are created by my family, but just as a
good FYI, are there ways I can protect myself against malicious code?
Running as root ensures I can read all the files w/o question.
[...]

If you must do this as root, I would seriously consider using find(1),
xargs(1) and md5(1) instead, assuming your find and xargs support the
-print0 and -0 arguments. You're much less likely to make a serious
mistake using preexisting utilities than trying to write your own.

Sorry to be so blunt but this is a really stupid suggestion: It's not
only that a lot of characters valid in filenames are of syntactic
relevance to the shell but it will also perform multiple passes of
textual substitution on a complete input line and happily execute
whatever the combined result happens to be, IOW, the shell does not
genuinely distinguish between 'script text from a file' and 'text
produced as result of an operation performed by the script', making it
an extremely poor choice for writing code supposed to run in a hostile
environment. perl is much better in this respect because it not only
doesn't execute data 'by default' (just when explicitly asked to) but
it can also be made to complain about a lot of potentially unsafe
'data flows', see 'Taint mode' in perlsec. These checks can be onerous
at times but they should catch a lot of accidental errors (such as the
2-arg open of a string which came from the file system).
 
M

Mike Scott

On 21/01/13 21:39, Clint O wrote:
.....
Ok, thanks for the tip and the heads-up. I am running the program as
root on a NAS, and the files are created by my family, but just as a
good FYI, are there ways I can protect myself against malicious code?
Running as root ensures I can read all the files w/o question. I've
used Safe before, but I'm not sure whether it's necessary or
appropriate for this application.

If I may ask a naive question.... Why are you writing a duplicate-file
finder from scratch when programs such as fdupes already exist and
presumably have such issues already resolved?

fdupes "searches the given path for duplicate files. Such files are
found by comparing file sizes and MD5 signatures, followed by a
byte-by-byte comparison". That last bit is important.
 
R

Rainer Weikusat

Mike Scott said:
On 21/01/13 21:39, Clint O wrote:
....

If I may ask a naive question.... Why are you writing a
duplicate-file finder from scratch when programs such as fdupes
already exist and presumably have such issues already resolved?

May I ask you an equally naive question? Why precisely do you think
your statement is even remotely on topic for a Perl newsgroup?
fdupes "searches the given path for duplicate files. Such files are
found by comparing file sizes and MD5 signatures, followed by a
byte-by-byte comparison". That last bit is important.

Indeed. It commmunicates that the author didn't really think straight:
Calculating a MD5 hash of a file requires an expensive processing
operation to be performed for each byte of this file. OTOH, comparing
the content of files of identical sizes (which should already be quite
rare) with each other will usually stop early if the files are not
identical.
 
M

Mike Scott

May I ask you an equally naive question? Why precisely do you think
your statement is even remotely on topic for a Perl newsgroup?

Because it's answering the issue implicit in the the original post, and
may save the OP considerable effort and pain. I entirely agree a
discussion of fdupes itself would be out of place here.
 
R

Rainer Weikusat

Mike Scott said:
Because it's answering the issue implicit in the the original post,

You asserted that you are convinced that a certain program wouldn't
suffer from a certain problem. The question was "Why doesn't
the perl 2-argument open work with filenames containing leading
whitespace?". Even if your believes about this program happen to be
correct, voicing them doesn't answer the question.
 
J

Jim Gibson

Rainer said:
Indeed. It commmunicates that the author didn't really think straight:
Calculating a MD5 hash of a file requires an expensive processing
operation to be performed for each byte of this file. OTOH, comparing
the content of files of identical sizes (which should already be quite
rare) with each other will usually stop early if the files are not
identical.

True enough, but if you have N files and are looking for duplicates
among any pair, it is probably more efficient to compute a checksum for
each of the files, then look for duplicates among the checksums. If the
files are large enough, comparing checksums will be faster than
comparing the files themselves.
 
W

Willem

Jim Gibson wrote:
) In article <[email protected]>, Rainer
)
)> Indeed. It commmunicates that the author didn't really think straight:
)> Calculating a MD5 hash of a file requires an expensive processing
)> operation to be performed for each byte of this file. OTOH, comparing
)> the content of files of identical sizes (which should already be quite
)> rare) with each other will usually stop early if the files are not
)> identical.
)
) True enough, but if you have N files and are looking for duplicates
) among any pair, it is probably more efficient to compute a checksum for
) each of the files, then look for duplicates among the checksums. If the
) files are large enough, comparing checksums will be faster than
) comparing the files themselves.

That depends. Even assuming the files are all the same size, it's
quite probable that there will be differences in the first block.

I think a good approach is to first group by file size, but read
the first N bytes of each file as well and keep those in memory.
(To take advantage of filesystems that store the first chunk of
a file inside the inode).


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
R

Rainer Weikusat

Ben Morrow said:
MD5 is not expensive. It probably takes less time to MD5 two files,
reading each sequentially, than it takes to read alternating blocks from
each file, with the associated disk seeks.

MD5 (or any other hashing algorithm) is a lot more expensive than a
comparison and especially so if MD5 needs to process 2G of data while
the comparison would only need 8K. This means that MD5 will usually
lose if the files are different. And MD5 + byte-by-byte comparison
will usually lose if they aren't. Anything else is a pathological
situation (eg, lots of large files differing in the last few bytes).
 
C

Charlton Wilbur

RW> MD5 (or any other hashing algorithm) is a lot more expensive
RW> than a comparison and especially so if MD5 needs to process 2G
RW> of data while the comparison would only need 8K.

You make several unfounded assumptions here.

One, that the cost of a single linear read of a file, such as needed to
calculate a file hash, is comparable in expense and time to two or more
interleaved file reads, such as needed to do a direct comparison. Since
the seeking takes more time than the reading -- often by an order of
magnitude -- and the reading takes more time than the calculation --
again, often by an order of magnitude -- it is difficult to support this
claim. Yes, in terms of raw processor time, calculating an MD5 hash on
each of two blocks of memory and then comparing the result is more
expensive than comparing the two blocks of memory, especially if the
comparison can terminate at the first difference; but processor time is
far from the only cost being paid, and in the average case where a
filesystem is involved I expect the tradeoffs to be far less clear.

Two, that the number of comparisons is small. The more comparisons you
have, the more the advantage goes to the hashing algorithm. If you have
2 files, it is best to read the first 8K of each and compare them,
since, as you note, odds are that any differences will appear early on.
If you have 1000 files, reading the first 8K of each file for
comparison purposes means a great deal of seeking and reading; and then
you either store the first 8K, leading to a large working set (and the
first time you swap, you've lost anything you won by avoiding
calculating hashes), or you repeatedly seek and read. MD5 hashes, at 16
bytes each, require a much smaller working set.

Three, that no other caching or optimization is possible. If this task
is done repeatedly, it should be possible to cache the hash values of
the files and compare a timestamp on the hash value to the timestamp on
the file. If two files differ in size, they are clearly not equivalent;
determining the size of the file may be basically free, since the cost
for the call is likely to have been paid by an unavoidable system call.

This is a tradeoff between disk seek time, disk read time, processor
time, and memory, and the optimal point varies depending on how many
files one is comparing. The MD5 approach burns processor time in an
attempt to save disk seek times and disk read times so that clock time
can be optimized. If you're optimizing for processor time, the hashing
approach is obviously not the way to go; if you're optimizing for clock
time or disk access, the question is a lot less cut and dried than you
seem to think.

Charlton
 
C

Charlton Wilbur

RW> You asserted that you are convinced that a certain program
RW> wouldn't suffer from a certain problem. The question was "Why
RW> doesn't the perl 2-argument open work with filenames containing
RW> leading whitespace?". Even if your believes about this program
RW> happen to be correct, voicing them doesn't answer the question.

It doesn't answer the question, but it solves the problem.

The OP asked his question because he wanted to compare a large number of
files for equality, and was writing a Perl script to do so.

Which do you think is more helpful -- "here's a way to get Perl to get
around the odd edge case you've encountered," or "here's an easily
available open source program that solves your larger problem"?

Charlton
 
R

Rainer Weikusat

alexd said:
It's not about being helpful, it's about who can "win" the argument.

In this case, it was about answering a question someone asked which
happened to be related to perl. If that someone should perhaps have
asked a different question in another newsgroup is for him to decide.
 
C

Charlton Wilbur

RW> In this case, it was about answering a question someone asked
RW> which happened to be related to perl. If that someone should
RW> perhaps have asked a different question in another newsgroup is
RW> for him to decide.

In other words, it really isn't about being helpful, as far as you're
concerned.

Charlton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top