Want your opinion on @ARGV file globbing

J

jl_post

Hi,

Recently someone asked me to write a Perl script that would operate
on a bunch of input files specified at the command line. This script
was meant for a Unix-ish system, but I developed it on an MSWin
system.

Normally, when I write a Perl script that takes an arbitrary number
of input files, I will include the line:

@ARGV = map glob, @ARGV; # for DOS globbing

I do this because the DOS shell passes in wildcard designators
unexpanded -- that is, if "*.txt" is specified as the only argument,
@ARGV will have "*.txt" as its only element. Unix shells, on the
other hand, usually expand the wildcards, so the glob()bing doesn't
have to be done.

So by using the above line I ensure that the script will have the
same wildcard-expanding behavior whether it is run in DOS or in Unix.

However, if the above line is called when run under Unix, then
technically the wildcard expansions get run twice: Once at the
command line, and once in my script. This will be a problem if any of
the input files have wildcard characters or spaces in them. For
example, if I have a file named "a b", and I call my Perl script with
"perl script.pl a*", then "a*" expands to include "a b", but then the
glob() call in the script expands that to "a" and "b", ignoring file
"a b" altogether.

So to work around that problem, I wrote my script so that it only
did file globbing if it was running on a Windows platform, like this:

if ($^O =~ m/^MSWin/i)
{
@ARGV = map glob, @ARGV; # for DOS globbing
}

This way, input arguments won't get "double-globbed."

Happy with this, I sent my script to the person who needed it. He
responded by saying that "the argument list [was] too long." It turns
out that the wildcard expression he was using expanded out to nearly
16,000 files, which caused the Unix shell he was using to refuse to
run the resulting (long) command line.

So I made a quick change to my script: I removed the above if-
check and advised him to pass in the wildcarded arguments surrounded
by quotes. That way the shell wouldn't expand out the wildcards,
leaving Perl to do it.

That "work-around" worked great. But that led me to ask: In
future scripts, should I include the check for $^O before calling glob
()? If I don't, then the files risk being "double-globbed" on Unix
systems -- but if I do, then I run the risk of the shell refusing to
call the script (without an available work-around).

Of course, this is often a moot point, as more than 99% of the
input files I ever processed have no wildcard characters or spaces in
their filenames. But that's a guarantee I can't always make.

Perhaps I could still call glob() by default on all systems, but
include a command-line switch that forces that not to happen (in order
to prevent double-globbing). That way, the switch could be mostly
ignored, but it is there in case it's ever needed.

Or am I just overthinking this? After all, glob()bing @ARGV in all
instances (that is, regardless of platform) has never given me a
problem (yet). Maybe I should just leave it in (to be called all the
time) after all.

What are your opinions on this? Is there a convention you use that
addresses this issue? In there an alternate way you prefer to handle
it?

Your thoughts and opinions are welcome.

Thanks,

-- Jean-Luc
 
J

Jim Gibson

[problems using glob on Unix and Windows systems snipped]

Your problem is one of the reasons I never use glob to find files to
process. I always use File::Find and specify the top-level directory,
either as a default or a command-line parameter. Then, I can apply any
appropriate filters to the actual file name and directory.
 
J

Jürgen Exner

Happy with this, I sent my script to the person who needed it. He
responded by saying that "the argument list [was] too long." It turns
out that the wildcard expression he was using expanded out to nearly
16,000 files, which caused the Unix shell he was using to refuse to
run the resulting (long) command line. [...]
Or am I just overthinking this? [...]

Yes, you are. There is nothing wrong with the original version of your
script and he has a problem with his shell, not with your Perl program.
The correct solution is the same as for any program on UNIX when the
shell complains about a too long arg list: use the find utility with the
execute option.

Actually your "fix" made it worse because your forced globbing in your
Perl program blocks the user from naming files with a star or a tilde in
their filenames.
If you really want to offer this changed behaviour then I would do it at
most as an additional option, controlled by a command line parameter.
Otherwise your script behaves different than any other Unix program and
this inconsistency will bite you sooner or later.

jue
 
K

Keith Thompson

Jürgen Exner said:
Happy with this, I sent my script to the person who needed it. He
responded by saying that "the argument list [was] too long." It turns
out that the wildcard expression he was using expanded out to nearly
16,000 files, which caused the Unix shell he was using to refuse to
run the resulting (long) command line. [...]
Or am I just overthinking this? [...]

Yes, you are. There is nothing wrong with the original version of your
script and he has a problem with his shell, not with your Perl program.
The correct solution is the same as for any program on UNIX when the
shell complains about a too long arg list: use the find utility with the
execute option.

Actually your "fix" made it worse because your forced globbing in your
Perl program blocks the user from naming files with a star or a tilde in
their filenames.
If you really want to offer this changed behaviour then I would do it at
most as an additional option, controlled by a command line parameter.
Otherwise your script behaves different than any other Unix program and
this inconsistency will bite you sooner or later.

Agreed. You should *not* do your own globbing by default on Unix.
If I type
prog 'foo bar'
and it processes the two files "foo" and "bar", that's extremely
counterintiutive behavior; it's also difficult to work around it
if I really want to process a file called "foo bar".

An option tell your program to do its own globbing wouldn't be
unreasonable, but personally I wouldn't use it; the right Unixish
solution is to use "find", "xargs", or something similar.

Or you could add an option to specify a file containing a list
of files. If you're processing 16,000 files, generating a list
of them isn't much of a burden.

I'm not sure what the default behavior should be on Windows.
Consistency between the Unix and Windows versions argues for not
doing your own globbing by default. Consistency between your program
and other Windows programs might argue for enabling it by default.

One thing you should look into: what does "glob" do with whitespace?
Many Windows file names contain spaces; you don't want to make it
difficult to process such files.
 
X

Xho Jingleheimerschmidt

So to work around that problem, I wrote my script so that it only
did file globbing if it was running on a Windows platform, like this:

if ($^O =~ m/^MSWin/i)
{
@ARGV = map glob, @ARGV; # for DOS globbing
}

This way, input arguments won't get "double-globbed."

Happy with this, I sent my script to the person who needed it. He
responded by saying that "the argument list [was] too long." It turns
out that the wildcard expression he was using expanded out to nearly
16,000 files, which caused the Unix shell he was using to refuse to
run the resulting (long) command line.

So I made a quick change to my script: I removed the above if-
check and advised him to pass in the wildcarded arguments surrounded
by quotes. That way the shell wouldn't expand out the wildcards,
leaving Perl to do it.

That "work-around" worked great. But that led me to ask: In
future scripts, should I include the check for $^O before calling glob
()? If I don't, then the files risk being "double-globbed" on Unix
systems -- but if I do, then I run the risk of the shell refusing to
call the script (without an available work-around).

If I am running a script on Linux, I'd generally expect it to work the
way almost every other Linux program works, and not double glob.
Of course, this is often a moot point, as more than 99% of the
input files I ever processed have no wildcard characters or spaces in
their filenames. But that's a guarantee I can't always make.

I often process input files that do have spaces in them, because we have
network drives that are cross mounted on both Linux and Windows, and
Windows users often include spaces in their file names.
Perhaps I could still call glob() by default on all systems, but
include a command-line switch that forces that not to happen (in order
to prevent double-globbing). That way, the switch could be mostly
ignored, but it is there in case it's ever needed.

I'd probably reverse that, and have it manually glob only with the
switch. But I guess it depends on what you think would be more likely,
huge file lists or file lists with space/wildcards.

One thing to consider is failure mode. I think a "argument list too
long" error is more likely to be noticed and correctly interpreted and
acted upon than a program which silently tries to process a double
globbed, and hence incorrect, list of files.
Or am I just overthinking this? After all, glob()bing @ARGV in all
instances (that is, regardless of platform) has never given me a
problem (yet).

Are you sure you would know if it had?

Maybe I should just leave it in (to be called all the
time) after all.

What are your opinions on this? Is there a convention you use that
addresses this issue? In there an alternate way you prefer to handle
it?


I have some scripts which are routinely called on thousands of files.
However, when used routinely, all the files are in a standard location
following a standard naming convention. So I have the program look at
@ARGV, after all switches are processed, and if it has exactly 1
argument and that one argument looks like, say, "ProjectXXX", then it
automatically does @ARGV=glob"/aldfj/dsf/ad/sdf/$ARGV[0]/klj*/*.csv";


If you have highly specialized needs, then you do highly specialized things.

Xho
 
X

Xho Jingleheimerschmidt

Jürgen Exner said:
Yes, you are. There is nothing wrong with the original version of your
script and he has a problem with his shell, not with your Perl program.
The correct solution is the same as for any program on UNIX when the
shell complains about a too long arg list: use the find utility with the
execute option.

Many programs do different things if invoked once on 16,000 files,
versus 16,000 times on one file each. Take "sort", for example.

Xho
 
J

Jürgen Exner

Xho Jingleheimerschmidt said:
Many programs do different things if invoked once on 16,000 files,
versus 16,000 times on one file each. Take "sort", for example.

Obviously. However in that case you need a different method to pass that
list of 16000 values anyway because is it not possible to pass them as
command line arguments.
Writing them to a file and loading that file via a -f option comes to
mind as one simple and effective solution. I am sure there are others.

jue
 
P

Peter Scott

Happy with this, I sent my script to the person who needed it. He
responded by saying that "the argument list [was] too long." It turns
out that the wildcard expression he was using expanded out to nearly
16,000 files, which caused the Unix shell he was using to refuse to run
the resulting (long) command line. [snip]
What are your opinions on this? Is there a convention you use that
addresses this issue? In there an alternate way you prefer to handle
it?

I would add an option (say, -g) to the program meaning, arguments should
be globbed internally. -g enabled by default on MS Win. Then people in
your user's unusual situation who don't want to or can't use find/xargs
have a solution.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top