improvement suggestion for File::Find: pre-parsed extensions

Ted Zlatanov · Jul 3, 2007

I think it would be really useful if in addition to $File::Find::name
and $_ there were also $File::Find::namenoext, $_namenoext ($_ without
the extension), and $File::Find::ext (the extension itself). On most
OSs it should be true that

$File::Find::namenoext . '.' . $File::Find::ext eq $File::Find::name

$_namenoext . '.' . $File::Find::ext eq $_

but when the extension is undef, $_namenoext eq $_ (I think undef is
better than '' for an empty extension, and would distinguish nicely
between "filename." and "filename").

I do this all the time by hand inside the wanted() function, and I think
an extra string match and three more variables won't hurt File::Find too
much. It's already IO-bound in the find() function.

The code for this is easy, but this is a core module so before producing
a patch I thought I'd ask: has it been done before, and did I miss
something in File::Find? I didn't find suggestions of this feature in
newsgroup or Google archives, but I'd like to hear if anyone has
suggestions for or against it.

Ted

Paul Lalli · Jul 3, 2007

I think it would be really useful if in addition to $File::Find::name
and $_ there were also $File::Find::namenoext, $_namenoext ($_ without
the extension), and $File::Find::ext (the extension itself).

File::Basename makes those values trivially easy to obtain. I see no
reason for File::Find to create them for you, as they're not needed in
a majority of applications for which File::Find is used.

Paul Lalli

Ted Zlatanov · Jul 3, 2007

PL> File::Basename makes those values trivially easy to obtain.

No, you still have to a) use the module, and b) call the functions.
That's not trivial compared to "the value is in $ext".

PL> I see no reason for File::Find to create them for you, as they're
PL> not needed in a majority of applications for which File::Find is
PL> used.

That hasn't been my experience, but then again I have written several
installers (and many other programs) using File::Find, so I may be
biased. In any case, as I mentioned, File::Find is IO-bound, so doing
the extra CPU work in the find() function would not harm performance,
and the extra memory usage is negligible. I don't think we should
sacrifice convenience for an unnecessary optimization.

Ted

Paul Lalli · Jul 3, 2007

PL> File::Basename makes those values trivially easy to obtain.

No, you still have to a) use the module, and b) call the functions.
That's not trivial compared to "the value is in $ext".

We'll have to agree to disagree as to the definition of "trivial".

and the extra memory usage is negligible. I don't think we should
sacrifice convenience for an unnecessary optimization.

And I don't think we should unnecessarily bloat a module to duplicate
functionality already available elsewhere.

Paul Lalli

Michele Dondi · Jul 3, 2007

I think it would be really useful if in addition to $File::Find::name
and $_ there were also $File::Find::namenoext, $_namenoext ($_ without
the extension), and $File::Find::ext (the extension itself). On most
OSs it should be true that

How 'bout @file::Find::file? But what do do if no_chdir => 1, anyway?

The code for this is easy, but this is a core module so before producing
a patch I thought I'd ask: has it been done before, and did I miss
something in File::Find? I didn't find suggestions of this feature in
newsgroup or Google archives, but I'd like to hear if anyone has
suggestions for or against it.

For me File::Basename is so near to my hand and the kind info you
mention rare enough to be a need of mine that I don't see that as a
compellingly desirable feature. Perhaps a F::F on steroids with either
additional stuff passed to the wanted() sub or an object $_ with
suitable methods *or both* would be welcome. For suitable methods I
mean $_->name, $_->ext, $_->basename, $_->stat (so that you don't have
to do that again, etc.) Of course this would be max fun in Perl 6 with
its unary dot:

find { .basename.say if .ext ~~ 'txt' }, $dir;

Michele

Michele Dondi · Jul 3, 2007

PL> File::Basename makes those values trivially easy to obtain.

No, you still have to a) use the module, and b) call the functions.
That's not trivial compared to "the value is in $ext".

Yep, but in several years of F::F's usage I can't remember having had
to use it so much. It may just be that my experience differs from
yours.

That hasn't been my experience, but then again I have written several
installers (and many other programs) using File::Find, so I may be
biased. In any case, as I mentioned, File::Find is IO-bound, so doing

To be fair, yes: I think you're biased.

the extra CPU work in the find() function would not harm performance,
and the extra memory usage is negligible. I don't think we should
sacrifice convenience for an unnecessary optimization.

But we shouldn't sacrifice simplicity and orthogonality for YAGNI.

Michele

Ted Zlatanov · Jul 3, 2007

PL> We'll have to agree to disagree as to the definition of "trivial".

Obviously anything that requires another module is not trivial, unless
you've done it for so long you don't notice the annoyance.

PL> And I don't think we should unnecessarily bloat a module to duplicate
PL> functionality already available elsewhere.

Explain where's the bloat. And why do you assume the functionality is
duplicated? Access to a variable doesn't equate duplication of effort.
The patch I had in mind would just use File::Basename internally. The
whole behaviour could be optional if you and others feel strongly, but
it's hardly duplication.

Ted

Ted Zlatanov · Jul 3, 2007

MD> For me File::Basename is so near to my hand and the kind info you
MD> mention rare enough to be a need of mine that I don't see that as a
MD> compellingly desirable feature. Perhaps a F::F on steroids with either
MD> additional stuff passed to the wanted() sub or an object $_ with
MD> suitable methods *or both* would be welcome. For suitable methods I
MD> mean $_->name, $_->ext, $_->basename, $_->stat (so that you don't have
MD> to do that again, etc.) Of course this would be max fun in Perl 6 with
MD> its unary dot:

MD> find { .basename.say if .ext ~~ 'txt' }, $dir;

Yes, that should really be an object. But that would be a major API
change compared to providing some extra info for the current file. I'm
sure Perl 6 will have something similar or better.

Ted

anno4000 · Jul 4, 2007

Ted Zlatanov said:
[...]

biased. In any case, as I mentioned, File::Find is IO-bound, so doing
the extra CPU work in the find() function would not harm performance,
and the extra memory usage is negligible.

You said that before, but have you checked? A run of find() that
calls File::Basename::fileparse() on every file takes twice the time
of one that only visits the files. Doubling the overhead isn't
something I'd do lightly.

Anno

use Benchmark qw( cmpthese);
use File::Find;
use File::Basename;
cmpthese( -1, {
without_base => sub {
find sub {}, '.';
},
with_base => sub {
find sub {
my ($name,$path,$suffix) = fileparse($File::Find::name);
},
'.';
},
});
__END__

Rate with_base without_base
with_base 14.6/s -- -51%
without_base 29.9/s 105% --

Ted Zlatanov · Jul 5, 2007

a> You said that before, but have you checked? A run of find() that
a> calls File::Basename::fileparse() on every file takes twice the time
a> of one that only visits the files. Doubling the overhead isn't
a> something I'd do lightly.

You're right, and I assumed incorrectly.

A simple parse function:

my $suffix = undef;
my $name = $File::Find::name;
my $base = $name;
if ($name =~ m/(.*)\.([^.]+)/)
{
$suffix = $2;
$base = $1;
}

is only 20% slower on a local filesystem, which is better but still not
good. So I would not enable this feature by default, but only when
requested (e.g. through File::Find::find_fullparse() or
$File::Find::fullparse=1 or the require parameters). Would that be
acceptable?

Ted

anno4000 · Jul 5, 2007

Ted Zlatanov said:
a> You said that before, but have you checked? A run of find() that
a> calls File::Basename::fileparse() on every file takes twice the time
a> of one that only visits the files. Doubling the overhead isn't
a> something I'd do lightly.

You're right, and I assumed incorrectly.

A simple parse function:

my $suffix = undef;
my $name = $File::Find::name;
my $base = $name;
if ($name =~ m/(.*)\.([^.]+)/)
{
$suffix = $2;
$base = $1;
}

is only 20% slower on a local filesystem, which is better but still not
good. So I would not enable this feature by default, but only when
requested (e.g. through File::Find::find_fullparse() or
$File::Find::fullparse=1 or the require parameters). Would that be
acceptable?

I believe as an option it would be useful. I think File::Find is
rather portable, so File::Basename would have to be used in order
not to deteriorate it.

Anno

Ted Zlatanov · Jul 5, 2007

A simple parse function:

my $suffix = undef;
my $name = $File::Find::name;
my $base = $name;
if ($name =~ m/(.*)\.([^.]+)/)
{
$suffix = $2;
$base = $1;
}

is only 20% slower on a local filesystem, which is better but still not
good. So I would not enable this feature by default, but only when
requested (e.g. through File::Find::find_fullparse() or
$File::Find::fullparse=1 or the require parameters). Would that be
acceptable?

Click to expand...

a> I believe as an option it would be useful. I think File::Find is
a> rather portable, so File::Basename would have to be used in order
a> not to deteriorate it.

Sure. I didn't mean my parse function should be used, only that it
wasn't fast enough either, so I was completely wrong on the IO-bound
assumption

This makes it 2 votes for, 2 against a patch that a) uses
File::Basename, b) pulls the functionality in only when requested.
Anyone else?

Here's a simple outline of the usage I would want:

use File::Find;
$File::Find::extparse = 1;

find(
sub
{
printf ("%s has full name %s, extension %s, and full name without extension %s\n",
$_, $File::Find::name, $File::Find::ext, $File::Find::name_noext);
}, @search);

I think $_ without the extension wouldn't be very useful, and would
require another fileparse() invocation.

Ted

Randal L. Schwartz · Jul 5, 2007

Ted> I think it would be really useful if in addition to $File::Find::name
Ted> and $_ there were also $File::Find::namenoext, $_namenoext ($_ without
Ted> the extension), and $File::Find::ext (the extension itself). On most
Ted> OSs it should be true that

Ted> $File::Find::namenoext . '.' . $File::Find::ext eq $File::Find::name

Ted> $_namenoext . '.' . $File::Find::ext eq $_

My problem with this is that Unix doesn't have "extension" as a fundamental
concept, and Perl is basically a "happiest with Unix" tool.

Unix is perfectly fine with a file named "foo" or "foo.bar.bletch". Of those
two, what would you call the "extension"?

This isn't Windows.

print "Just another Perl hacker,"; # the original!

Ted Zlatanov · Jul 5, 2007

Ted> I think it would be really useful if in addition to $File::Find::name
Ted> and $_ there were also $File::Find::namenoext, $_namenoext ($_ without
Ted> the extension), and $File::Find::ext (the extension itself). On most
Ted> OSs it should be true that

Ted> $File::Find::namenoext . '.' . $File::Find::ext eq $File::Find::name

Ted> $_namenoext . '.' . $File::Find::ext eq $_

RLS> My problem with this is that Unix doesn't have "extension" as a fundamental
RLS> concept, and Perl is basically a "happiest with Unix" tool.

RLS> Unix is perfectly fine with a file named "foo" or "foo.bar.bletch". Of those
RLS> two, what would you call the "extension"?

1) '' according to File::Basename (undef would also work)
2) bletch

RLS> This isn't Windows.

File::Basename runs everywhere and understands extensions (it calls them
suffixes).

Lots of software looks at file extensions. GNU Make and Emacs, for
example. Let's not forget Perl uses .pm for modules.

I really don't think this is a Windows-only concept.

Ted

Bo Lindbergh · Jul 6, 2007

I'd say the right way to improve File::Find is to write an object-oriented
replacement that uses no global variables at all.

/Bo Lindbergh

Michele Dondi · Jul 6, 2007

I'd say the right way to improve File::Find is to write an object-oriented
replacement that uses no global variables at all.

Which is what I was suggesting in another post. But do we really need
another F::F's son when there already exist File::Finder and
File::Find::Rule?

Michele

Ted Zlatanov · Jul 6, 2007

A> Ted Zlatanov ([email protected]) wrote on VLVI September MCMXCIII in
A> <URL:A> __ On Thu, 05 Jul 2007 12:02:24 -0700 (e-mail address removed) (Randal L. Schwartz) wrote:
A> __
A> __
A> __ RLS> This isn't Windows.
A> __
A> __ File::Basename runs everywhere and understands extensions (it calls them
A> __ suffixes).
A> __
A> __ Lots of software looks at file extensions. GNU Make and Emacs, for
A> __ example. Let's not forget Perl uses .pm for modules.

A> No they don't.

A> The software you mention looks for patterns in filenames - patterns
A> that may resemble how "extensions" look like on certain operating
A> systems. The operating system itself (assuming it's Unix) doesn't care.
A> And it's not that you cannot have a module in a file that doesn't
A> end in '.pm'. Such patterns may be called 'suffixes', but they aren't
A> 'extensions'.

I really don't feel like arguing over semantics and what "Unix" cares
about. My goal is to improve the user experience. Let's agree
applications often care about what follows the last dot, and therefore
this suffix or extension or whatever you want to call it is a useful
concept that users often need to manage files. I know about magic
numbers and MIME types and so on, and my proposal doesn't concern any of
that.

The core File::Basename module obviously implements that idea as a
"suffix" and I think it would be nice to the end user of File::Find to
*optionally* provide a suffix for the currently visited file through
already existing File::Basename functionality. That's all.

Ted

Ted Zlatanov · Jul 6, 2007

JS> If you actually look at File::Basename and its definition of suffix,
JS> you will see that is does _not_ use '.' for that purpose. It has no
JS> preconceived notion as to what character signifies the beginning of
JS> a suffix. You, the programmer, have to explicitly provide that information.

JS> Hardcoding '.' as the separator character is not good.

Practically speaking, I simply don't care what the separator character
is. I want the last suffix; on Windows, all Unices I know, and MacOS X
that suffix follows the last dot *by convention*. There's no
platform-neutral suffix separator, so let's use '.' and be done with it
for the sake of at least 95% of the installed Perl base.

Ted

Mumia W. · Jul 6, 2007

I really don't feel like arguing over semantics and what "Unix" cares
about. My goal is to improve the user experience. Let's agree
applications often care about what follows the last dot, and therefore
this suffix or extension or whatever you want to call it is a useful
concept that users often need to manage files. I know about magic
numbers and MIME types and so on, and my proposal doesn't concern any of
that.

The core File::Basename module obviously implements that idea as a
"suffix" and I think it would be nice to the end user of File::Find to
*optionally* provide a suffix for the currently visited file through
already existing File::Basename functionality. That's all.

Ted

I vote no. Explicit support for parsing "suffixes" is not needed in
File::Find, and the user can improve his or her experience by writing a
nominal amount of code.

If someone deems that writing that code is too much work, he or she can
create a new module and place it on CPAN.

Michele Dondi · Jul 6, 2007

I really don't feel like arguing over semantics and what "Unix" cares
about. My goal is to improve the user experience. Let's agree

Laudable attempt. But then it is my impression from the responses you
got here, that you wouldn't.

Michele

using File::Find	33	Mar 23, 2011
use file::find to find files modified in last 5 days	8	Feb 22, 2012
Returning values from a parsed file (design issue)	1	Nov 12, 2006
need some help excluding with file::find::rule	6	Jan 22, 2010
croak/confess from within File::Find	4	Jul 10, 2007
.htaccess for file extensions?	2	Oct 30, 2005
rb_yield(), break, and C extensions	10	Apr 7, 2007
File::Find::Rule for files younger than 24 hours	8	Jan 24, 2006

improvement suggestion for File::Find: pre-parsed extensions

Ted Zlatanov

Paul Lalli

Ted Zlatanov

Paul Lalli

Michele Dondi

Michele Dondi

Ted Zlatanov

Ted Zlatanov

anno4000

Ted Zlatanov

anno4000

Ted Zlatanov

Randal L. Schwartz

Ted Zlatanov

Bo Lindbergh

Michele Dondi

Ted Zlatanov

Ted Zlatanov

Mumia W.

Michele Dondi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads