Trouble with embedded whitespace in filenames using File::Find

Discussion in 'Perl Misc' started by Clint O, Jan 21, 2013.

  1. Clint O

    Clint O Guest

    The following program I wrote I'm using to find duplicate files. The problem is that I have files with whitespace or potentially other special characters:

    #!/opt/local/bin/perl

    use Digest::MD5;
    use File::Find;
    use Data::Dumper;

    use strict;
    use warnings;

    my %results = ();

    sub do_file;

    my @files = @ARGV;

    exit 1 if !@files;

    find(sub { do_file(\%results) }, @files );

    for (keys %results) {
    my @f = @{$results{$_}};

    if (scalar @f > 1) {
    print "$f[0] => $f[1]\n";
    }
    }

    sub do_file {
    my ($hash) = @_;
    return if -d $_;

    open(my $fh, $_) or die "Can't open '$File::Find::name': $!";
    binmode $fh;

    my $digest;

    $digest = Digest::MD5->new->addfile($fh)->hexdigest;
    close $fh;

    push @{$hash->{$digest}}, $File::Find::name;
    }

    0;

    If I create a test directory:

    $ mkdir test_dir
    $ cd test_dir
    $ touch " my file"
    $ ./dupcheck testdir
    Can't open 'testdir/ my file': No such file or directory at ./dupcheck line 32.

    I can't be the first one who has run into this problem, and I'm sure there's a reasonable explanation for how to cope with this, but I haven't been able to find anything via the searching etc. on the web.

    Thanks,

    -Clint
     
    Clint O, Jan 21, 2013
    #1
    1. Advertising

  2. Clint O

    Clint O Guest

    On Monday, January 21, 2013 1:15:19 PM UTC-8, Henry Law wrote:
    >
    > > Can't open 'testdir/ my file': No such file or directory at ./dupcheck line 32.

    >
    >
    >
    > You created
    >
    >
    >
    > "test_dir/my file"
    >
    > ^
    >
    >
    >
    > and you're trying to open
    >
    >
    >
    > "testdir/ my file".
    >
    > ^
    >
    >
    >
    > It's not there, so the program complains.


    Well, that "test_dir" is clearly a typo. This program would have never generated this output with a non-existent directory:

    $ ./dupcheck /asfasfasdfasdf
    Can't stat /asfasfasdfasdf: No such file or directory
    at ./dupcheck line 18

    Anyway, my issue still stands. I cannot open a local file with embedded whitespace.

    Thanks,

    -Clint
     
    Clint O, Jan 21, 2013
    #2
    1. Advertising

  3. Clint O <> writes:

    [...]

    > The following program I wrote I'm using to find duplicate files. The
    > problem is that I have files with whitespace or potentially other
    > special characters:


    [...]

    > open(my $fh, $_) or die "Can't open '$File::Find::name': $!";


    Since you didn't specify an explicit open mode, perl parses $_ in
    order to look for one and it skips leading whitespace, cf

    The filename passed to 2-argument (or 1-argument) form of
    open() will have leading and trailing whitespace deleted, and
    the normal redirection characters honored.
    [perldoc -f open]

    Using open($fh, '<', $_) instead works.

    BTW: Assuming you're running this as root, someone who doesn't like
    you could create a file named |rm -rf `printf "\x2f"` and you probably
    wouldn't like the result of trying to open that.

    NB: DO NOT TRY THIS. Except if I made an error, this will execute rm
    -rf / with the privileges of the invoker.

    More harmless: td/|ls `printf "..\x2f"`. This will list the contents
    of the directory above td.
     
    Rainer Weikusat, Jan 21, 2013
    #3
  4. Clint O

    Clint O Guest

    On Monday, January 21, 2013 1:24:28 PM UTC-8, Rainer Weikusat wrote:
    > Since you didn't specify an explicit open mode, perl parses $_ in
    >
    > order to look for one and it skips leading whitespace, cf
    >
    >
    >
    > The filename passed to 2-argument (or 1-argument) form of
    >
    > open() will have leading and trailing whitespace deleted, and
    >
    > the normal redirection characters honored.
    >
    > [perldoc -f open]
    >
    >
    >
    > Using open($fh, '<', $_) instead works.
    >
    >
    >
    > BTW: Assuming you're running this as root, someone who doesn't like
    >
    > you could create a file named |rm -rf `printf "\x2f"` and you probably
    >
    > wouldn't like the result of trying to open that.
    >
    >
    >
    > NB: DO NOT TRY THIS. Except if I made an error, this will execute rm
    >
    > -rf / with the privileges of the invoker.
    >
    >
    >
    > More harmless: td/|ls `printf "..\x2f"`. This will list the contents
    >
    > of the directory above td.


    Ok, thanks for the tip and the heads-up. I am running the program as root on a NAS, and the files are created by my family, but just as a good FYI, are there ways I can protect myself against malicious code? Running as root ensures I can read all the files w/o question. I've used Safe before, but I'm not sure whether it's necessary or appropriate for this application.

    Thanks,

    -Clint
     
    Clint O, Jan 21, 2013
    #4
  5. Clint O <> wrote:
    >On Monday, January 21, 2013 1:15:19 PM UTC-8, Henry Law wrote:
    >>
    >> > Can't open 'testdir/ my file': No such file or directory at ./dupcheck line 32.

    >>
    >> You created
    >> "test_dir/my file"
    >> ^
    >> and you're trying to open
    >> "testdir/ my file".
    >> ^
    >>
    >> It's not there, so the program complains.

    >
    >Well, that "test_dir" is clearly a typo.


    So, you should be thankful that Clint found that typo and pointed it out
    to, right?

    >Anyway, my issue still stands. I cannot open a local file with embedded whitespace.


    Well, nobody claimed that there is only on issue in your program.

    jue
     
    Jürgen Exner, Jan 21, 2013
    #5
  6. Clint O

    Clint O Guest

    On Monday, January 21, 2013 2:21:26 PM UTC-8, Jürgen Exner wrote:
    > Clint O wrote:
    >
    > >On Monday, January 21, 2013 1:15:19 PM UTC-8, Henry Law wrote:

    >
    > >>

    >
    > >> > Can't open 'testdir/ my file': No such file or directory at ./dupcheck line 32.

    >
    > >>

    >
    > >> You created

    >
    > >> "test_dir/my file"

    >
    > >> ^

    >
    > >> and you're trying to open

    >
    > >> "testdir/ my file".

    >
    > >> ^

    >
    > >>

    >
    > >> It's not there, so the program complains.

    >
    > >

    >
    > >Well, that "test_dir" is clearly a typo.

    >
    >
    >
    > So, you should be thankful that Clint found that typo and pointed it out
    >
    > to, right?
    >
    >
    >
    > >Anyway, my issue still stands. I cannot open a local file with embedded whitespace.

    >
    >
    >
    > Well, nobody claimed that there is only on issue in your program.


    Well, if you're going to critique my program and bother to post a reply, atleast make it relevant. People request that you post entire scripts so that the problem can be seen by others. I did due diligence by posting the script and made a mistake in the testcase.

    -Clint
     
    Clint O, Jan 21, 2013
    #6
  7. Clint O <> wrote:
    [Fullquote to prove my point]
    >On Monday, January 21, 2013 2:21:26 PM UTC-8, Jürgen Exner wrote:
    >> Clint O wrote:
    >>
    >> >On Monday, January 21, 2013 1:15:19 PM UTC-8, Henry Law wrote:

    >>
    >> >>

    >>
    >> >> > Can't open 'testdir/ my file': No such file or directory at ./dupcheck line 32.

    >>
    >> >>

    >>
    >> >> You created

    >>
    >> >> "test_dir/my file"

    >>
    >> >> ^

    >>
    >> >> and you're trying to open

    >>
    >> >> "testdir/ my file".

    >>
    >> >> ^

    >>
    >> >>

    >>
    >> >> It's not there, so the program complains.

    >>
    >> >

    >>
    >> >Well, that "test_dir" is clearly a typo.

    >>
    >>
    >>
    >> So, you should be thankful that Clint found that typo and pointed it out
    >>
    >> to, right?
    >>
    >>
    >>
    >> >Anyway, my issue still stands. I cannot open a local file with embedded whitespace.

    >>
    >>
    >>
    >> Well, nobody claimed that there is only on issue in your program.

    >
    >Well, if you're going to critique my program and bother to post a reply, at least make it relevant. People request that you post entire scripts so that the problem can be seen by others. I did due diligence by posting the script and made a mistake in the testcase.


    Ok, because you explicitely asked for it:
    - is there a specific reason why you are adding an empty line after
    every line you quote? That doesn't improve readability one bit and makes
    quoting your post rather tedious.
    - Is there a specific reason why your lines are longer than the usual
    70-75 characters?

    jue
     
    Jürgen Exner, Jan 21, 2013
    #7
  8. Clint O

    Clint O Guest

    On Monday, January 21, 2013 3:08:35 PM UTC-8, Jürgen Exner wrote:
    > Ok, because you explicitely asked for it:
    >
    > - is there a specific reason why you are adding an empty line after
    >
    > every line you quote? That doesn't improve readability one bit and makes
    >
    > quoting your post rather tedious.
    >
    > - Is there a specific reason why your lines are longer than the usual
    >
    > 70-75 characters?


    I'm guessing these might be artifacts of Google Groups web interface. That's what I'm using to read the group. It's hard(er) to control the formattingof my responses. Coming from a hard-nosed slrn background, I agree that itis annoying, and if I can figure it out I will fix it.

    Thanks,

    -Clint
     
    Clint O, Jan 21, 2013
    #8
  9. Ben Morrow <> writes:
    > Quoth Clint O <>:
    >> On Monday, January 21, 2013 1:24:28 PM UTC-8, Rainer Weikusat wrote:
    >> >
    >> > BTW: Assuming you're running this as root, someone who doesn't like
    >> >
    >> > you could create a file named |rm -rf `printf "\x2f"` and you probably
    >> >
    >> > wouldn't like the result of trying to open that.

    >>
    >> Ok, thanks for the tip and the heads-up. I am running the program as
    >> root on a NAS, and the files are created by my family, but just as a
    >> good FYI, are there ways I can protect myself against malicious code?
    >> Running as root ensures I can read all the files w/o question.


    [...]

    > If you must do this as root, I would seriously consider using find(1),
    > xargs(1) and md5(1) instead, assuming your find and xargs support the
    > -print0 and -0 arguments. You're much less likely to make a serious
    > mistake using preexisting utilities than trying to write your own.


    Sorry to be so blunt but this is a really stupid suggestion: It's not
    only that a lot of characters valid in filenames are of syntactic
    relevance to the shell but it will also perform multiple passes of
    textual substitution on a complete input line and happily execute
    whatever the combined result happens to be, IOW, the shell does not
    genuinely distinguish between 'script text from a file' and 'text
    produced as result of an operation performed by the script', making it
    an extremely poor choice for writing code supposed to run in a hostile
    environment. perl is much better in this respect because it not only
    doesn't execute data 'by default' (just when explicitly asked to) but
    it can also be made to complain about a lot of potentially unsafe
    'data flows', see 'Taint mode' in perlsec. These checks can be onerous
    at times but they should catch a lot of accidental errors (such as the
    2-arg open of a string which came from the file system).
     
    Rainer Weikusat, Jan 22, 2013
    #9
  10. Clint O

    Mike Scott Guest

    On 21/01/13 21:39, Clint O wrote:
    .....
    >
    > Ok, thanks for the tip and the heads-up. I am running the program as
    > root on a NAS, and the files are created by my family, but just as a
    > good FYI, are there ways I can protect myself against malicious code?
    > Running as root ensures I can read all the files w/o question. I've
    > used Safe before, but I'm not sure whether it's necessary or
    > appropriate for this application.
    >


    If I may ask a naive question.... Why are you writing a duplicate-file
    finder from scratch when programs such as fdupes already exist and
    presumably have such issues already resolved?

    fdupes "searches the given path for duplicate files. Such files are
    found by comparing file sizes and MD5 signatures, followed by a
    byte-by-byte comparison". That last bit is important.

    --
    Mike Scott (unet2 <at> [deletethis] scottsonline.org.uk)
    Harlow Essex England
     
    Mike Scott, Jan 23, 2013
    #10
  11. Mike Scott <> writes:
    > On 21/01/13 21:39, Clint O wrote:
    > ....
    >>
    >> Ok, thanks for the tip and the heads-up. I am running the program as
    >> root on a NAS, and the files are created by my family, but just as a
    >> good FYI, are there ways I can protect myself against malicious code?
    >> Running as root ensures I can read all the files w/o question. I've
    >> used Safe before, but I'm not sure whether it's necessary or
    >> appropriate for this application.

    >
    > If I may ask a naive question.... Why are you writing a
    > duplicate-file finder from scratch when programs such as fdupes
    > already exist and presumably have such issues already resolved?


    May I ask you an equally naive question? Why precisely do you think
    your statement is even remotely on topic for a Perl newsgroup?

    > fdupes "searches the given path for duplicate files. Such files are
    > found by comparing file sizes and MD5 signatures, followed by a
    > byte-by-byte comparison". That last bit is important.


    Indeed. It commmunicates that the author didn't really think straight:
    Calculating a MD5 hash of a file requires an expensive processing
    operation to be performed for each byte of this file. OTOH, comparing
    the content of files of identical sizes (which should already be quite
    rare) with each other will usually stop early if the files are not
    identical.
     
    Rainer Weikusat, Jan 23, 2013
    #11
  12. Clint O

    Mike Scott Guest

    On 23/01/13 13:21, Rainer Weikusat wrote:
    > Mike Scott <> writes:

    ....
    >> If I may ask a naive question.... Why are you writing a
    >> duplicate-file finder from scratch when programs such as fdupes
    >> already exist and presumably have such issues already resolved?

    >
    > May I ask you an equally naive question? Why precisely do you think
    > your statement is even remotely on topic for a Perl newsgroup?


    Because it's answering the issue implicit in the the original post, and
    may save the OP considerable effort and pain. I entirely agree a
    discussion of fdupes itself would be out of place here.

    --
    Mike Scott (unet2 <at> [deletethis] scottsonline.org.uk)
    Harlow Essex England
     
    Mike Scott, Jan 23, 2013
    #12
  13. Mike Scott <> writes:
    > On 23/01/13 13:21, Rainer Weikusat wrote:
    >> Mike Scott <> writes:

    > ...
    >>> If I may ask a naive question.... Why are you writing a
    >>> duplicate-file finder from scratch when programs such as fdupes
    >>> already exist and presumably have such issues already resolved?

    >>
    >> May I ask you an equally naive question? Why precisely do you think
    >> your statement is even remotely on topic for a Perl newsgroup?

    >
    > Because it's answering the issue implicit in the the original post,


    You asserted that you are convinced that a certain program wouldn't
    suffer from a certain problem. The question was "Why doesn't
    the perl 2-argument open work with filenames containing leading
    whitespace?". Even if your believes about this program happen to be
    correct, voicing them doesn't answer the question.
     
    Rainer Weikusat, Jan 23, 2013
    #13
  14. Clint O

    Jim Gibson Guest

    In article <>, Rainer
    Weikusat <> wrote:

    > Indeed. It commmunicates that the author didn't really think straight:
    > Calculating a MD5 hash of a file requires an expensive processing
    > operation to be performed for each byte of this file. OTOH, comparing
    > the content of files of identical sizes (which should already be quite
    > rare) with each other will usually stop early if the files are not
    > identical.


    True enough, but if you have N files and are looking for duplicates
    among any pair, it is probably more efficient to compute a checksum for
    each of the files, then look for duplicates among the checksums. If the
    files are large enough, comparing checksums will be faster than
    comparing the files themselves.

    --
    Jim Gibson
     
    Jim Gibson, Jan 23, 2013
    #14
  15. Clint O

    Willem Guest

    Jim Gibson wrote:
    ) In article <>, Rainer
    ) Weikusat <> wrote:
    )
    )> Indeed. It commmunicates that the author didn't really think straight:
    )> Calculating a MD5 hash of a file requires an expensive processing
    )> operation to be performed for each byte of this file. OTOH, comparing
    )> the content of files of identical sizes (which should already be quite
    )> rare) with each other will usually stop early if the files are not
    )> identical.
    )
    ) True enough, but if you have N files and are looking for duplicates
    ) among any pair, it is probably more efficient to compute a checksum for
    ) each of the files, then look for duplicates among the checksums. If the
    ) files are large enough, comparing checksums will be faster than
    ) comparing the files themselves.

    That depends. Even assuming the files are all the same size, it's
    quite probable that there will be differences in the first block.

    I think a good approach is to first group by file size, but read
    the first N bytes of each file as well and keep those in memory.
    (To take advantage of filesystems that store the first chunk of
    a file inside the inode).


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Jan 23, 2013
    #15
  16. Ben Morrow <> writes:
    > Quoth Jim Gibson <>:
    >> In article <>, Rainer
    >> Weikusat <> wrote:
    >>
    >> > Indeed. It commmunicates that the author didn't really think straight:
    >> > Calculating a MD5 hash of a file requires an expensive processing

    >
    > MD5 is not expensive. It probably takes less time to MD5 two files,
    > reading each sequentially, than it takes to read alternating blocks from
    > each file, with the associated disk seeks.


    MD5 (or any other hashing algorithm) is a lot more expensive than a
    comparison and especially so if MD5 needs to process 2G of data while
    the comparison would only need 8K. This means that MD5 will usually
    lose if the files are different. And MD5 + byte-by-byte comparison
    will usually lose if they aren't. Anything else is a pathological
    situation (eg, lots of large files differing in the last few bytes).
     
    Rainer Weikusat, Jan 23, 2013
    #16
  17. >>>>> "RW" == Rainer Weikusat <> writes:

    RW> MD5 (or any other hashing algorithm) is a lot more expensive
    RW> than a comparison and especially so if MD5 needs to process 2G
    RW> of data while the comparison would only need 8K.

    You make several unfounded assumptions here.

    One, that the cost of a single linear read of a file, such as needed to
    calculate a file hash, is comparable in expense and time to two or more
    interleaved file reads, such as needed to do a direct comparison. Since
    the seeking takes more time than the reading -- often by an order of
    magnitude -- and the reading takes more time than the calculation --
    again, often by an order of magnitude -- it is difficult to support this
    claim. Yes, in terms of raw processor time, calculating an MD5 hash on
    each of two blocks of memory and then comparing the result is more
    expensive than comparing the two blocks of memory, especially if the
    comparison can terminate at the first difference; but processor time is
    far from the only cost being paid, and in the average case where a
    filesystem is involved I expect the tradeoffs to be far less clear.

    Two, that the number of comparisons is small. The more comparisons you
    have, the more the advantage goes to the hashing algorithm. If you have
    2 files, it is best to read the first 8K of each and compare them,
    since, as you note, odds are that any differences will appear early on.
    If you have 1000 files, reading the first 8K of each file for
    comparison purposes means a great deal of seeking and reading; and then
    you either store the first 8K, leading to a large working set (and the
    first time you swap, you've lost anything you won by avoiding
    calculating hashes), or you repeatedly seek and read. MD5 hashes, at 16
    bytes each, require a much smaller working set.

    Three, that no other caching or optimization is possible. If this task
    is done repeatedly, it should be possible to cache the hash values of
    the files and compare a timestamp on the hash value to the timestamp on
    the file. If two files differ in size, they are clearly not equivalent;
    determining the size of the file may be basically free, since the cost
    for the call is likely to have been paid by an unavoidable system call.

    This is a tradeoff between disk seek time, disk read time, processor
    time, and memory, and the optimal point varies depending on how many
    files one is comparing. The MD5 approach burns processor time in an
    attempt to save disk seek times and disk read times so that clock time
    can be optimized. If you're optimizing for processor time, the hashing
    approach is obviously not the way to go; if you're optimizing for clock
    time or disk access, the question is a lot less cut and dried than you
    seem to think.

    Charlton





    --
    Charlton Wilbur
     
    Charlton Wilbur, Jan 24, 2013
    #17
  18. >>>>> "RW" == Rainer Weikusat <> writes:

    RW> You asserted that you are convinced that a certain program
    RW> wouldn't suffer from a certain problem. The question was "Why
    RW> doesn't the perl 2-argument open work with filenames containing
    RW> leading whitespace?". Even if your believes about this program
    RW> happen to be correct, voicing them doesn't answer the question.

    It doesn't answer the question, but it solves the problem.

    The OP asked his question because he wanted to compare a large number of
    files for equality, and was writing a Perl script to do so.

    Which do you think is more helpful -- "here's a way to get Perl to get
    around the odd edge case you've encountered," or "here's an easily
    available open source program that solves your larger problem"?

    Charlton



    --
    Charlton Wilbur
     
    Charlton Wilbur, Jan 24, 2013
    #18
  19. alexd <> writes:
    > Charlton Wilbur (for it is he) wrote:
    >
    >> Which do you think is more helpful

    >
    > It's not about being helpful, it's about who can "win" the argument.


    In this case, it was about answering a question someone asked which
    happened to be related to perl. If that someone should perhaps have
    asked a different question in another newsgroup is for him to decide.
     
    Rainer Weikusat, Jan 24, 2013
    #19
  20. >>>>> "RW" == Rainer Weikusat <> writes:

    RW> alexd <> writes:

    >> It's not about being helpful, it's about who can "win" the
    >> argument.


    RW> In this case, it was about answering a question someone asked
    RW> which happened to be related to perl. If that someone should
    RW> perhaps have asked a different question in another newsgroup is
    RW> for him to decide.

    In other words, it really isn't about being helpful, as far as you're
    concerned.

    Charlton




    --
    Charlton Wilbur
     
    Charlton Wilbur, Jan 25, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Oli Filth
    Replies:
    9
    Views:
    3,360
    Uncle Pirate
    Jan 17, 2005
  2. B.J.
    Replies:
    4
    Views:
    765
    Toby Inkster
    Apr 23, 2005
  3. Replies:
    10
    Views:
    784
    Eric Brunel
    Dec 16, 2008
  4. MRAB
    Replies:
    3
    Views:
    400
  5. Ezra Zygmuntowicz
    Replies:
    0
    Views:
    97
    Ezra Zygmuntowicz
    Jan 27, 2006
Loading...

Share This Page