Module to match file names against a wildcard spec?

Henry Law · Jun 15, 2005

I've searched CPAN and the web for an answer to this without finding
anything (but I confess I found it hard to structure a query so I may
have missed something). Maybe someone can point me in the right
direction.

A Perl program I'm writing reads in file names from a specified
directory amd then processes them (how doesn't matter). If
subdirectories are found they are processed recursively.

I need to be able to restrict its operation by specifying groups of
files via a wild card; the control syntax looks a bit like this

# Include contents of C:\foo and all its subdirectories
include C:\foo
# But don't do text files in the root of \foo
exclude C:\foo\*.txt
# Note that text files elsewhere in the \foo tree, such as
# C:\foo\bar\bletch.txt should be processed.

(The above is Windows, obviously, but I need to write this so it works
on Unix too).

I'm getting really tangled up trying to turn my exclude specifications
into regexes which I can then use to exclude relevant files; not only
am I not a very experienced Perl coder but the logic of the task turns
out to be quite complicated. For example

#! /usr/bin/perl
use strict;
use warnings;
my @names =
("F:/NOTES/DATA/ABBC2.NSF","F:/NOTES/DATA/FURBLE/ABBC2.NSF");
my $excl_spec = "F:/NOTES/DATA/.*\.NSF";

for (@names) {
if (/$excl_spec/i) {
print "$_ matches\n";
} else {
print "$_ doesn't match\n";
}
}

When run this gives
F:/NOTES/DATA/ABBC2.NSF matches
F:/NOTES/DATA/FURBLE/ABBC2.NSF matches

.... which isn't what I want because the .* eats up the additional
subdirectory FURBLE as well.

I've looked at File::Spec and File::CheckTree without finding what I
want. Can anyone suggest either (1) A module that would help with
wildcard processing of file names, or (2) A better way of coding this
kind of thing?

Christopher Nehren · Jun 16, 2005

I've searched CPAN and the web for an answer to this without finding
anything (but I confess I found it hard to structure a query so I may
have missed something). Maybe someone can point me in the right
direction.

A Perl program I'm writing reads in file names from a specified
directory amd then processes them (how doesn't matter). If
subdirectories are found they are processed recursively.
[...]
I've looked at File::Spec and File::CheckTree without finding what I
want. Can anyone suggest either (1) A module that would help with
wildcard processing of file names, or (2) A better way of coding this
kind of thing?

File::Find perhaps? Maybe File::Find::Rule if you find File::Find too
difficult (though I can't imagine why; I've found it to be delightfully
easy-to-use)?

Best Regards,
Christopher Nehren

Henry Law · Jun 16, 2005

On 2005-06-15, Henry Law scribbled these
curious markings:

A Perl program I'm writing reads in file names from a specified
directory amd then processes them (how doesn't matter). If
subdirectories are found they are processed recursively.
[...]
I've looked at File::Spec and File::CheckTree without finding what I
want. Can anyone suggest either (1) A module that would help with
wildcard processing of file names, or (2) A better way of coding this
kind of thing?

Click to expand...

File::Find perhaps?

Yes, I'm familiar with File::Find and there are two reasons why I'm
not using it. Firstly I can't prevent it from scanning sub-trees

include C:\some\path
exclude C:\some\path\huge\subdirectory

and secondly when it invokes its "wanted" function I'm still faced
with working out whether $File::Find::name matches my "exclude"
specification, which is the bit I'm stuck on. My tree-following logic
probably isn't perfect but it works well enough.

Chris · Jun 16, 2005

Henry said:
On 2005-06-15, Henry Law scribbled these
curious markings:

A Perl program I'm writing reads in file names from a specified
directory amd then processes them (how doesn't matter). If
subdirectories are found they are processed recursively.
[...]
I've looked at File::Spec and File::CheckTree without finding what I
want. Can anyone suggest either (1) A module that would help with
wildcard processing of file names, or (2) A better way of coding this
kind of thing?

Click to expand...

File::Find perhaps?

Click to expand...

Yes, I'm familiar with File::Find and there are two reasons why I'm
not using it. Firstly I can't prevent it from scanning sub-trees

include C:\some\path
exclude C:\some\path\huge\subdirectory

File::Find has a "preprocess" option that may help here. From the
documentation:

"The value should be a code reference. This code reference is used to
preprocess the current directory. <snip> Your preprocessing function
is called after readdir(), but before the loop that calls the wanted()
function. <snip> The code can be used to sort the file/directory names
alphabetically, numerically, or to filter out directory entries based on
their name alone."

and secondly when it invokes its "wanted" function I'm still faced
with working out whether $File::Find::name matches my "exclude"
specification, which is the bit I'm stuck on. My tree-following logic
probably isn't perfect but it works well enough.

It looks like the problem is in your regex. You put:

my $excl_spec = "F:/NOTES/DATA/.*\.NSF";

which matched "F:/NOTES/DATA/ABBC2.NSF" and
"F:/NOTES/DATA/FURBLE/ABBC2.NSF". The reason it matched the second
string is because of the ".*" - you probably don't want to match ANY
character. For instance, you probably don't want to match the directory
separator "/". Try it with:

my $excl_spec = "F:/NOTES/DATA/[^/]*\.NSF";

That should only match .NSF files in that directory.

Hope that helps

-chris

Henry Law · Jun 16, 2005

Henry Law wrote:

File::Find has a "preprocess" option that may help here. From the
documentation:

Ah, I'd looked at this but not in the right way; I see what you mean.
If I want to exclude certain subdirectories from being processed I can
drop them out of the list that the "preprocess" subroutine returns.
Neat; thank you.

It looks like the problem is in your regex. You put:

my $excl_spec = "F:/NOTES/DATA/.*\.NSF";

which matched "F:/NOTES/DATA/ABBC2.NSF" and
"F:/NOTES/DATA/FURBLE/ABBC2.NSF". The reason it matched the second
string is because of the ".*" - you probably don't want to match ANY
character. For instance, you probably don't want to match the directory
separator "/". Try it with:

my $excl_spec = "F:/NOTES/DATA/[^/]*\.NSF";

Yes, I can see that now. I'm re-casting the exclusion checking to
split the checked file name and the exclude specification into their
component parts (F:, NOTES, DATA, .*\.NSF) and then I can immediately
tell if the two aren't at the same depth in the tree, before doing
regex-type matches on the respective parts. I've not shot all the
bugs yet but it looks promising.

Thanks for all the help.

Anno Siegel · Jun 16, 2005

Henry Law said:
I've searched CPAN and the web for an answer to this without finding
anything (but I confess I found it hard to structure a query so I may
have missed something). Maybe someone can point me in the right
direction.

A Perl program I'm writing reads in file names from a specified
directory amd then processes them (how doesn't matter). If
subdirectories are found they are processed recursively.

I need to be able to restrict its operation by specifying groups of
files via a wild card; the control syntax looks a bit like this

# Include contents of C:\foo and all its subdirectories
include C:\foo
# But don't do text files in the root of \foo
exclude C:\foo\*.txt
# Note that text files elsewhere in the \foo tree, such as
# C:\foo\bar\bletch.txt should be processed.

(The above is Windows, obviously, but I need to write this so it works
on Unix too).

I'm getting really tangled up trying to turn my exclude specifications
into regexes which I can then use to exclude relevant files; not only
am I not a very experienced Perl coder but the logic of the task turns
out to be quite complicated. For example

Do you actually need that?

A glob-to-regex translator wouldn't be very hard to write (I think).
The hard part is getting the specification right for all kinds of
file system with so many variants of glob around. That is probably
why there isn't one in Regex::Common, where it would belong.

However, since everything you want to include or exclude are actual
files in a file system (right?), you don't have to do that, you can
use Perl's glob() function together with File::Find. Here's a sketch:

use File::Find;

my $dir = 'c:\foo';
my $exclude = 'C:\foo\*.txt'
@exclude{ glob( $exclude)} = ();

find sub {
return if exists $exclude{ $File::Find::name};
# process file
}, $dir;

Anno

Anno Siegel · Jun 16, 2005

Henry Law said:
I've searched CPAN and the web for an answer to this without finding
anything (but I confess I found it hard to structure a query so I may
have missed something). Maybe someone can point me in the right
direction.

A Perl program I'm writing reads in file names from a specified
directory amd then processes them (how doesn't matter). If
subdirectories are found they are processed recursively.

I need to be able to restrict its operation by specifying groups of
files via a wild card; the control syntax looks a bit like this

# Include contents of C:\foo and all its subdirectories
include C:\foo
# But don't do text files in the root of \foo
exclude C:\foo\*.txt
# Note that text files elsewhere in the \foo tree, such as
# C:\foo\bar\bletch.txt should be processed.

(The above is Windows, obviously, but I need to write this so it works
on Unix too).

I'm getting really tangled up trying to turn my exclude specifications
into regexes which I can then use to exclude relevant files; not only
am I not a very experienced Perl coder but the logic of the task turns
out to be quite complicated. For example

Do you actually need that?

A glob-to-regex translator wouldn't be very hard to write (I think).
The hard part is getting the specification right for all kinds of
file system with so many variants of glob around. That is probably
why there isn't one in Regex::Common, where it would belong.

However, since everything you want to include or exclude are actual
files in a file system (right?), you don't have to do that, you can
use Perl's glob() function together with File::Find. Here's a sketch:

use File::Find;

my $dir = 'c:\foo';
my $exclude = 'C:\foo\*.txt'
my %exclude;
@exclude{ glob( $exclude)} = ();

find sub {
return if exists $exclude{ $File::Find::name};
# process file
}, $dir;

Anno

Henry Law · Jun 16, 2005

However, since everything you want to include or exclude are actual
files in a file system (right?),

Indeed. Just files or complete sub-directories.

you don't have to do that, you can
use Perl's glob() function together with File::Find. Here's a sketch:

use File::Find;

my $dir = 'c:\foo';
my $exclude = 'C:\foo\*.txt'
@exclude{ glob( $exclude)} = ();

Uuh ... this is where I feel like the sorcerer's apprentice: totally
out of my depth. There are Perl constructs here that I simply don't
recognise:

@exclude{ glob( $exclude)} = ();
^ ^ ^
| | |
| 1. why don't we have to declare "@exclude" with "my"?
| |
| 2. That looks like a hash but "@exclude" is an array; or
| is this some kind of subroutine? Surely not ...
|
| 3. Empty list ... but why, and where
| is it going to?

If you could help me by pointing out the perldoc references where this
seam of witchcraft is described I'll go and read up!

The rest of your post - the part dealing with File::Find - I do
understand. Thanks in the mean time.

A. Sinan Unur · Jun 16, 2005

On 16 Jun 2005 09:48:30 GMT, (e-mail address removed)-berlin.de (Anno
Siegel) wrote:

Uuh ... this is where I feel like the sorcerer's apprentice: totally
out of my depth. There are Perl constructs here that I simply don't
recognise:

@exclude{ glob( $exclude)} = ();
^ ^ ^
| | |
| 1. why don't we have to declare "@exclude" with "my"?
| |
| 2. That looks like a hash but "@exclude" is an array; or
| is this some kind of subroutine? Surely not ...
|
| 3. Empty list ... but why, and where
| is it going to?

If you could help me by pointing out the perldoc references where this
seam of witchcraft is described I'll go and read up!

perldoc perldata

Read the section on slices.

If Anno had specified

use strict;

he would have had to have:

my %exclude;

before the assignment to the slice.

Sinan

A. Sinan Unur · Jun 16, 2005

If Anno had specified

use strict;

And so he did, in <[email protected]>.

Sorry, I had not seen that one.

Sinan

Anno Siegel · Jun 16, 2005

A. Sinan Unur said:
And so he did, in <[email protected]>.

Sorry, I had not seen that one.

Yes, the declaration was meant to be there, it got lost in a copy/paste
operation. I corrected that fast, but not fast enough for modern Usenet,
it seems.

It used to be I could safely send a "supersede" within 5 minutes or so
and still catch it on my server most of the time. These days, more
uncorrected postings seem to escape, something must be spinning faster.
I'll adjust my discipline and add a note to superseding postings that
marks them as such.

Anno

Ilmari Karonen · Jun 17, 2005

Anno Siegel said:
A glob-to-regex translator wouldn't be very hard to write (I think).

It seems this has, in fact, been done already.

http://search.cpan.org/dist/Text-Glob/

Anno Siegel · Jun 18, 2005

Ilmari Karonen said:
It seems this has, in fact, been done already.

http://search.cpan.org/dist/Text-Glob/

Ah, yes. I haven't run it, but the doc looks good.

Anno

best way to build an absolute spec	2	Jun 1, 2014
Translater + module + tkinter	1	Feb 16, 2023
Search nested folders with specific names in python	0	Sep 23, 2022
CentOS 6.5 / SPEC file	0	Mar 27, 2014
Better way of checking file/directory spec in Win32?	4	Sep 15, 2005
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
Just a basic wildcard	7	Jan 31, 2006
PYTHONPATH and module names	6	Jul 1, 2013

Module to match file names against a wildcard spec?

Henry Law

Christopher Nehren

Henry Law

Chris

Henry Law

Anno Siegel

Anno Siegel

Henry Law

A. Sinan Unur

A. Sinan Unur

Anno Siegel

Ilmari Karonen

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads