Two questions about main's arguments

D

DSF

Hello all,

A little background on the first question. A while back I
discovered that the parsing of the command line to **argv would be
corrupted if a parameter enclosed in quotes ended in a backslash, such
as "c:\Test 1\". (Assuming "c:Test 1\" should be argv[1], argv[1]
would also contain part or all of the next parameter, and so on,
pretty much screwing up the input to main entirely.) Recently, I had
the opportunity to investigate this, which I assumed was a bug in the
compiler's startup code. Upon browsing through the source, I found
that it is done intentionally. The startup code considers \" to be an
ESCAPE QUOTE sequence. Is this part of the standard, or just a
strange "feature" thought up by the compiler designer?

Is there any portable way to get the unparsed data passed to main?
Since I have never read anything that said this can be done, I
assume it can't. However, since C lets the programmer "get to the
guts" of things more than most other languages, I would expect there
to be a way to get at the unparsed data.

Thanks,
DSF

P.S. Question 2 was not posed to find a solution to the problem in
question 1. I solved that problem by modifying the startup source and
including it in the relevant projects.
 
L

Lew Pitcher

Hello all,

A little background on the first question. A while back I
discovered that the parsing of the command line to **argv would be
corrupted if a parameter enclosed in quotes ended in a backslash, such
as "c:\Test 1\". (Assuming "c:Test 1\" should be argv[1], argv[1]
would also contain part or all of the next parameter, and so on,
pretty much screwing up the input to main entirely.) Recently, I had
the opportunity to investigate this, which I assumed was a bug in the
compiler's startup code. Upon browsing through the source, I found
that it is done intentionally. The startup code considers \" to be an
ESCAPE QUOTE sequence. Is this part of the standard, or just a
strange "feature" thought up by the compiler designer?

It's a "strange feature" introduced by the compiler designer. The standard
makes no mention of /how/ the operating environment is to parse strings
into main()'s arguments; this is best left up to the operating environment.
Is there any portable way to get the unparsed data passed to main?

No. There is no portable way to get the unparsed data.
 
K

Keith Thompson

DSF said:
A little background on the first question. A while back I
discovered that the parsing of the command line to **argv would be
corrupted if a parameter enclosed in quotes ended in a backslash, such
as "c:\Test 1\". (Assuming "c:Test 1\" should be argv[1], argv[1]
would also contain part or all of the next parameter, and so on,
pretty much screwing up the input to main entirely.) Recently, I had
the opportunity to investigate this, which I assumed was a bug in the
compiler's startup code. Upon browsing through the source, I found
that it is done intentionally. The startup code considers \" to be an
ESCAPE QUOTE sequence. Is this part of the standard, or just a
strange "feature" thought up by the compiler designer?

What do you mean by "*the* startup code"? It sounds like you're
talking about some specific implementation.

There is nothing in the C standard about any special treatment of
backslash characters in the strings pointed to by the elements of argv.

In a C string literal, the sequence \" denotes a single quotation mark
character; that's just the meaning of the literal, not some run-time
translation.

On Unix-like systems, C programs are typically invoked from a shell
command line. Shells often treat a backslash as an escape
character; for example
/bin/echo "foo\"bar"
prints
foo"bar
on my system. But this translation happens before the program is
invoked; the program itself, including any system startup code,
never sees the backslash.
Is there any portable way to get the unparsed data passed to main?
Since I have never read anything that said this can be done, I
assume it can't. However, since C lets the programmer "get to the
guts" of things more than most other languages, I would expect there
to be a way to get at the unparsed data.

This is outside the scope of the C standard.

Just as an example, on Unix-like systems, programs are invoked via
one of the exec*() family of functions, called by the shell or by
some other program. These functions take arguments as pointers to C
strings, which are used to initialize argc and argv. If something
between the exec*() function and your main program does something
funny with backslash characters, I'd find that surprising, but the
C standard has nothing to say about it.

To see what the standard *does* say about how argc and argv are
initialized, grab a copy of
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf>
and read section 5.1.2.2.1 (note that it applies only to hosted
implementations).

[...]
 
L

Lew Pitcher

Hello all,

A little background on the first question. A while back I
discovered that the parsing of the command line to **argv would be
corrupted if a parameter enclosed in quotes ended in a backslash, such
as "c:\Test 1\". (Assuming "c:Test 1\" should be argv[1], argv[1]
would also contain part or all of the next parameter, and so on,
pretty much screwing up the input to main entirely.) Recently, I had
the opportunity to investigate this, which I assumed was a bug in the
compiler's startup code. Upon browsing through the source, I found
that it is done intentionally. The startup code considers \" to be an
ESCAPE QUOTE sequence. Is this part of the standard, or just a
strange "feature" thought up by the compiler designer?

It's a "strange feature" introduced by the compiler designer. The standard
makes no mention of /how/ the operating environment is to parse strings
into main()'s arguments; this is best left up to the operating
environment.

For what it's worth, here's what the C 90 standard had to say about main()'s
parameters:
5.1.2.2.1 Program startup
The function called at program startup is named main. The implementation
declares no prototype for this function. It shall be deï¬ned with a return
type of int and with no parameters:
int main(void) { /* ... */ }
or with two parameters (referred to here as argc and argv, though any
names may be used, as they are local to the function in which they are
declared):
int main(int argc, char *argv[]) { /* ... */ }
or equivalent; or in some other implementation-deï¬ned manner.
If they are declared, the parameters to the main function shall obey the
following constraints:
— The value of argc shall be nonnegative.
— argv[argc] shall be a null pointer.
— If the value of argc is greater than zero, the array members argv[0]
through argv[argc-1] inclusive shall contain pointers to strings, which
are given implementation-deï¬ned values by the host environment prior to
program startup. The intent is to supply to the program information
determined prior to program startup from elsewhere in the hosted
environment. If the host environment is not capable of supplying strings
with letters in both uppercase and lowercase, the implementation shall
ensure that the strings are received in lowercase.
— If the value of argc is greater than zero, the string pointed to by
argv[0] represents the program name; argv[0][0] shall be the null
character if the program name is not available from the host
environment. If the value of argc is greater than one, the strings
pointed to by argv[1] through argv[argc-1] represent the program
parameters.
— The parameters argc and argv and the strings pointed to by the argv
array shall be modiï¬able by the program, and retain their last-stored
values between program startup and program termination.

The key phrases are: "given implementation-defined values by the host
environment prior to program startup" and "determined prior to program
startup from elsewhere in the hosted environment". In other words, the
standard doesn't say /how/ the strings are to be parsed from the original
source material, or even /whether/ the strings are to be parsed from the
orignal source material. All it says is that there will be strings.

[snip]
 
S

Seebs

A little background on the first question. A while back I
discovered that the parsing of the command line to **argv would be
corrupted if a parameter enclosed in quotes ended in a backslash, such
as "c:\Test 1\". (Assuming "c:Test 1\" should be argv[1], argv[1]
would also contain part or all of the next parameter, and so on,
pretty much screwing up the input to main entirely.) Recently, I had
the opportunity to investigate this, which I assumed was a bug in the
compiler's startup code. Upon browsing through the source, I found
that it is done intentionally. The startup code considers \" to be an
ESCAPE QUOTE sequence. Is this part of the standard, or just a
strange "feature" thought up by the compiler designer?

This question depends a whole lot on the compiler you're using. But,
I will say: I have never in my life used a system on which any of the
above was true. Every compiler I've ever used has done NOTHING AT ALL
to parse or interpret or even look at the arguments passed to it; that
was done by the program which invoked my program. On a Unix system,
typically that would be the shell.

So you're off in the implementation-defined territory of "how do you
invoke programs and what determines what their arguments are".

-s
 
L

Lew Pitcher

A little background on the first question. A while back I
discovered that the parsing of the command line to **argv would be
corrupted if a parameter enclosed in quotes ended in a backslash, such
as "c:\Test 1\". (Assuming "c:Test 1\" should be argv[1], argv[1]
would also contain part or all of the next parameter, and so on,
pretty much screwing up the input to main entirely.) Recently, I had
the opportunity to investigate this, which I assumed was a bug in the
compiler's startup code. Upon browsing through the source, I found
that it is done intentionally. The startup code considers \" to be an
ESCAPE QUOTE sequence. Is this part of the standard, or just a
strange "feature" thought up by the compiler designer?

This question depends a whole lot on the compiler you're using. But,
I will say: I have never in my life used a system on which any of the
above was true. Every compiler I've ever used has done NOTHING AT ALL
to parse or interpret or even look at the arguments passed to it; that
was done by the program which invoked my program. On a Unix system,
typically that would be the shell.

IIRC, in the MSDOS/Windows environment, some of that parsing is done by the
application runtime code (think filename globing) rather than the operating
environment. It wouldn't surprise me to find that Microsoft delegated the
whole task of argument parsing to the application.
 
N

Nobody

IIRC, in the MSDOS/Windows environment, some of that parsing is done by the
application runtime code (think filename globing) rather than the operating
environment. It wouldn't surprise me to find that Microsoft delegated the
whole task of argument parsing to the application.

It has. A process is created with a command line consisting of a single
string. If the program uses the WinMain() entry point, this may never be
parsed into words.

If the program uses main(), it's up to the program to parse the string
into individual arguments (and historically, different compilers had
slightly different parsing rules regarding e.g. quotes and backslashes).

Moreover, in NT-based dialects, the command line is a Unicode string. If
the program uses WinMain() or main(), this will be converted to
the current codepage; if it uses wWinMain() or wmain(), the data stays in
Unicode.

Executing a program which uses main() via e.g. system() will cause the
command line to be converted to Unicode then back to the current codepage.
This transformation isn't always lossless.

As others have pointed out, on Unix there may not even be a command
string. A process may call one of the exec() functions directly, in which
case no parsing is involved.
 
D

Denis McMahon

This question depends a whole lot on the compiler you're using.

I think it's got nothing to do with the compiler, and everything to do
with how the cli or gui concerned prepares the parameters it invoking
main with.

All the compiler can do is create start-up code to process the received
strings into an array and pass a pointer to the array and an int count
of the array size to main.

The start up code receives the parsed strings from the calling cli / gui.

If the cli/gui parses:

progrm "dum de dum \" derfr "munge"

as the strings:

progrm
dum de dum \
derfr
munge

then those are the strings that will be passed, however, if it parses as:

progrm
dum de dum \" derfr
munge

then those are the strings that will be passed from the cli / gui to the
startup code.

The startup code isn't there to second guess the string parsing that has
already been done by the cli/gui!

Rgds

Denis McMahon
 
S

Seebs

I think it's got nothing to do with the compiler, and everything to do
with how the cli or gui concerned prepares the parameters it invoking
main with.

Not so; it's implementation-defined how that happens, and apparently
some systems just pass a string to each executable, and leave it up to
the runtime to do the parsing.

Not a good choice, IMHO, but hey, it's up to them.

-s
 
R

robertwessel2

Not so; it's implementation-defined how that happens, and apparently
some systems just pass a string to each executable, and leave it up to
the runtime to do the parsing.

Not a good choice, IMHO, but hey, it's up to them.


Well, both approaches are clearly "good enough," but neither, IMHO, is
clearly superior, and both have serious flaws. The *nix approach of
globbing filenames, for example, certainly handles some common and
simple cases well, but can make more complex commands difficult or
awkward to implement. Consider the reasonably common command line
where you specify several input files and options for each:

cmd file1 -a file2 -b -c file3 -a -d

(and feel free to move the options in front of the filenames if you
prefer). If you specify file2 as "*file2*" The *nix approach makes a
hash of things for you, unless the user remembers to (awkwardly) quote
the filename, and then the program has to handle the wildcards anyway
(or you do something semi-unnatural like put parenthesis around the
"file2* -b -c" group, and have the program recognize those).

Or a case where you really do want the program to handle the wildcard,
say a directory search, like a Windows "dir a* /s", which will find
all files starting with a in the current directory and any
subdirectories. Requiring the filespec to be quoted, ala find, is not
particularly natural, IMO. IOW, why does "ls a*" do the expected*
thing, but "ls -R a*" does not, and why does anyone think "find . -
name "a*" " is a reasonable syntax to expect?

Of course *we* know what's happening, and why, but it's certainly not
an obvious behavior for a non-expert.

OTOH, the MS-DOS/Windows approach leaves even the common/simple cases
up to the applications, and you have a hash of different parsing rules
at the whims of the developers (of either the applications or the CRT
startup code). And that's a whole different mess.

And then throw in filenames that contain whitespace, and basically
none of these approaches work particularly well…


*Expected in the sense that basically every tutorial I've ever seen
describes a case similar to "ls a*" as "displaying all the files in
the current directory starting with 'a'."
 
N

Nick Keighley

Well, both approaches are clearly "good enough," but neither, IMHO, is
clearly superior, and both have serious flaws.  The *nix approach of
globbing filenames, for example, certainly handles some common and
simple cases well, but can make more complex commands difficult or
awkward to implement.

I've had problems when globbing generated truly enormous command lines


<snip>
 
D

David Resnick

I've had problems when globbing generated truly enormous command lines

<snip>

xargs is often a solution to that problem in *nix... Mind you, if you
need a single program invocation and exceed whatever shell or other
limits there may be on command line length you are stuck putting the
data in a file or whatever other workaround you come up with. But for
on the fly fixing of the huge command line problem xargs is quite
nice. Also good for passing 1 arg at a time to a program, etc.
 
K

Kenny McCormack

Well, both approaches are clearly "good enough," but neither, IMHO, is
clearly superior, and both have serious flaws.  The *nix approach of
globbing filenames, for example, certainly handles some common and
simple cases well, but can make more complex commands difficult or
awkward to implement.

I've had problems when globbing generated truly enormous command lines[/QUOTE]

I think it is clear, at this late date, that the so-called "MSDOS" way
of doing it is best, *provided* that the implementation provides a
standard library call to do the parsing that, in the Unix way, the shell
normally does for you. And makes it damn clear that every app should
invoke this function unless it has a damn good reason not to (something
like 'find' being a good example of a valid exception).

Implicit in the above is the fact that neither Unix nor DOS/Windows
ever had the concept of "application developer discipline" (a term I
just made up). Developers were always encouraged to "go their own way",
to "live free" and to "follow their dreams". Contrast this with the Mac
way, where there were always clear guidelines (from Apple) telling you
how to do each little thing (and that you deviated at your peril).

I see the "Unix way" (of handling the command line) as an attempt to
"standardize without standardizing" - that is, to nip the problem in the
bud, without requiring explicit/heavy-handed edicts that might be seen as
crimping fragile developer egos.
 
T

Tom St Denis

I think it is clear, at this late date, that the so-called "MSDOS" way
of doing it is best, *provided* that the implementation provides a
standard library call to do the parsing that, in the Unix way, the shell
normally does for you.  And makes it damn clear that every app should
invoke this function unless it has a damn good reason not to (something
like 'find' being a good example of a valid exception).

Quote yer damn arguments


tstdenis@photon:~$ cd wtf
tstdenis@photon:~/wtf$ touch kenny
tstdenis@photon:~/wtf$ touch is
tstdenis@photon:~/wtf$ touch an
tstdenis@photon:~/wtf$ touch idiot
tstdenis@photon:~/wtf$ echo *
an idiot is kenny
tstdenis@photon:~/wtf$ echo "*"
*

Wow, that's hard.

Tom
 
K

Kenny McCormack

Quote yer damn arguments


tstdenis@photon:~$ cd wtf
tstdenis@photon:~/wtf$ touch kenny
tstdenis@photon:~/wtf$ touch is
tstdenis@photon:~/wtf$ touch an
tstdenis@photon:~/wtf$ touch idiot
tstdenis@photon:~/wtf$ echo *
an idiot is kenny
tstdenis@photon:~/wtf$ echo "*"
*

Wow, that's hard.

Tom

Somebody needs to get back on their meds. (And it ain't me!)

I'm not saying there is anything inherently wrong with the Unix-y way,
but it is sub-optimal. And I think most people can be honest about that.

--
No, I haven't, that's why I'm asking questions. If you won't help me,
why don't you just go find your lost manhood elsewhere.

CLC in a nutshell.
 
F

FredK

Kenny McCormack said:
I think it is clear, at this late date, that the so-called "MSDOS" way
of doing it is best, *provided* that the implementation provides a
standard library call to do the parsing that, in the Unix way, the shell
normally does for you. And makes it damn clear that every app should
invoke this function unless it has a damn good reason not to (something
like 'find' being a good example of a valid exception).

Try VMS. It provides a CLI (command line interpreter) interface that
requires both a formal description of the user interface, and provides CLI
routines to the application that allow both access to the unmodified command
string as well as the individual already validated components (parameters,
switches, etc). The OS command line interpreter component validates and
processes all of the command line before the application is called at it's
entry point. The CLI routines can also be used to process input on-the-fly
(additional user input after the application is executing).

It is a pain in the butt to use, the application code to support it takes a
little bit of work to understand, it does have it's limitations and
constraints, and isn't always pretty.

However, it gives you a consistent command line UI that isn't based on
whitespace and imagination.

It also represents the thing most application writers appear to hate - the
bookeeping details of writing an application itself, as opposed to the
application code. It's much simpler to loop through argv looking for "-f"
usually pasting code that you've reused 50 times before.

Command line interfaces, error checking, and decent commenting are the #1
shortcommings for most code written the "UNIX way" (including Linux under
just-another-UNIX).
 
D

DSF

Hello all,

Thank you all for answering.

I must admit that all of my C programming experience has been in the
MSDOS/Windows world. In that world, the entire command line is passed
as one large string, which must be parsed by the C startup code (that
which is executed before main) to produce the argc/argv style input
that main expects. I also must admit that it never occurred to me
that this may not be the normal or even most common way of doing
things. (I have a red forehead where I've slapped myself, Bull
Shannon-style :) Hence my question about getting the data unparsed, I
assumed it was normal for the compiler's startup code to receive one
huge string and split it up. It didn't occur to me that the OS might
provide the data already parsed. (And judging by the number of
responses along the lines of "Your operating system does that."
pre-parsed data must be much closer to "normal.")


So both of my questions have been answered.

The ESCAPE QUOTE on the command line is the design of the compiler
writer and not there to follow a C specification of what argv
receives. That means that rewriting the startup code to eliminate
that "feature" as I mentioned in my post script is not breaking any C
specification.

The reason I've never read anything on how to get an unparsed
command line is because it's a MSDOS/Windows thing. Since portability
is not an issue, this leaves me free to explore something I read in
the comments of the startup code along the lines of preserving an
intact command line for programmers that want it.

DSF
 
T

Tom St Denis

Somebody needs to get back on their meds.  (And it ain't me!)

I'm not saying there is anything inherently wrong with the Unix-y way,
but it is sub-optimal.  And I think most people can be honest about that.

sub-optimal how? If I want to tar all .c files in a dir I can specify
*.c as opposed to manually globbing it.

If I want to literally specify *.c I can write \*.c or "*.c" ...

But usually I don't want to pass a program the literal *.c.

Can you elaborate under which typical circumstance the automatic
globbing is detrimental?

Tom
 
K

Kenny McCormack

I'm not saying there is anything inherently wrong with the Unix-y way,
but it is sub-optimal.  And I think most people can be honest about that.

sub-optimal how? If I want to tar all .c files in a dir I can specify
*.c as opposed to manually globbing it.[/QUOTE]

The usual principle is "If method A supports all options and method B
only supports a subset of that, then method A is better - even if method
A requires more work to do the most common cases."

I.e., if you had access to the entire command line (in Unix), you could do
anything, but if you don't (have that access), your options are limited.

Note, BTW, that in modern Windows programming (DOS has another way of
doing the same thing, but, as Jacob often points out, DOS is way old at
this point and we shouldn't have to continue to talk about it), you
actually do have your cake and eat it too. You have the normal, Unixy
argv/argc by the usual method, but if you need to, you can get the
unparsed command line in the COMMANDLINE environment variable (Note: I
haven't tested this last - so I could be wrong on a detail or two).
If I want to literally specify *.c I can write \*.c or "*.c" ...

But usually I don't want to pass a program the literal *.c.

Can you elaborate under which typical circumstance the automatic
globbing is detrimental?

Tom

Obvious observation (which I'm sure you are well aware of): It puts too
much onus on the user. Again, think of 'find'.

'mmv' (not standard SUS, of course) is another example of a command
where you always have to quote your arguments; it's a pain.
 
R

robertwessel2

Note, BTW, that in modern Windows programming (DOS has another way of
doing the same thing, but, as Jacob often points out, DOS is way old at
this point and we shouldn't have to continue to talk about it), you
actually do have your cake and eat it too.  You have the normal, Unixy
argv/argc by the usual method, but if you need to, you can get the
unparsed command line in the COMMANDLINE environment variable (Note: I
haven't tested this last - so I could be wrong on a detail or two).


Not sure about the environment variable, but there's an Windows API
function, GetCommandLine(), which does the obvious thing (and supports
both ANSI and Unicode strings). If you'd like a C main()-style parse
of that (or any other string), you can use CommandLineToArgvW(). Of
course, this being Windows, it doesn't glob.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,271
Latest member
BuyAtenaLabsCBD

Latest Threads

Top