regex in filenames

G

GSA

Hi,
How do I get the list of filenames in a folder using reg-ex in C?
Scenario:
I have a set of files, say
fileData1.txt
fileData2.txt
fileData3.txt
fileData4.txt
..
..
fileData31.txt
I need to take each of these files in a loop and process each one. How
can I use regex, say "fileData*.txt" to retrieve each of them?
PS:I have files with other names also on the folder.

Regards,
GSA
 
M

Malcolm McLean

Hi,
How do I get the list of filenames in a folder using reg-ex in C?
Scenario:
I have a set of files, say
fileData1.txt
fileData2.txt
fileData3.txt
fileData4.txt
.
.
fileData31.txt
I need to take each of these files in a loop and process each one. How
can I use regex, say "fileData*.txt" to retrieve each of them?
PS:I have files with other names also on the folder.

Regards,
GSA

Here's the code for a wildcard matcher. Wild cards (globbers) are
easier to use an to implement than regular expressions.

You'll need to open the directory and step through it.

static int chmatch(const char *target, const char *pat);

/*
wildcard matcher.
Params: str - the target string
pattern - pattern to mathc
Returns: 1 if match, 0 if not.
Notes: ? - match any character
* - match zero or more characters
[?], [*], escapes.
*/
int matchwild(const char *str, const char *pattern)
{
const char *target = str;
const char *pat = pattern;
int gobble;

while( (gobble = chmatch(target, pat)) )
{
target++;
pat += gobble;
}
if(*target == 0 && *pat == 0)
return 1;
else if(*pat == '*')
{
if(pat[1] == 0)
return 1;
while(*target)
if(matchwild(target++, pat+1))
return 1;
}
return 0;
}

/*
match a character.
Parmas: target - target string
pat - pattern string.
Returns: number of pat character matched.
Notes: means that a * in pat will return zero
*/
static int chmatch(const char *target, const char *pat)
{
if(*target == 0 || *pat == 0)
return 0;

if(*pat == '*')
return 0;

if(*target == '?')
{
if(pat[0] == '[' && pat[1] == '?' && pat[2] == ']')
return 3;
else
return 0;
}
if(*target == '*')
{
if(pat[0] == '[' && pat[1] == '*' && pat[2] == ']')
return 3;
else
return 0;
}
if(*target == *pat)
return 1;
if(*pat == '?')
return 1;
return 0;
}
 
N

Nobody

How do I get the list of filenames in a folder using reg-ex in C?

regexps aren't part of the C standard; the functions involved are
platform-specific. Actually, directories (folders) aren't part of the C
standard either, so anything involving directory listing is
platform-specific.
I need to take each of these files in a loop and process each one. How
can I use regex, say "fileData*.txt" to retrieve each of them?

I suspect you don't mean that as a regexp, but as a "glob" pattern.

As a regexp, it would match "fileDat" followed by zero or more "a"s
followed by any single character followed by "txt". As a glob pattern, it
would match "fileData" followed by zero or more characters followed by
".txt".

On Windows, the standard directory-listing function FindFirstFile()
automatically performs glob matching (* and ? aren't allowed in filenames
on Windows, so there's no conflict):

http://msdn.microsoft.com/en-us/library/aa364418(VS.85).aspx

On POSIX, the glob() function can be used to find all matching filenames:

http://www.opengroup.org/onlinepubs/9699919799/functions/glob.html

The fnmatch() function tests whether a given string matches a glob pattern:

http://www.opengroup.org/onlinepubs/9699919799/functions/fnmatch.html

If you actually want to match against regexps rather than glob patterns,
use opendir/readdir/closedir to list the directory, and use
regcomp/regexec to match the names against the regexp. No idea how you'd
do that on Windows (does it even have regexp functions?).
 
B

Ben Bacarisse

Malcolm McLean said:
Here's the code for a wildcard matcher. Wild cards (globbers) are
easier to use an to implement than regular expressions.

You'll need to open the directory and step through it.

static int chmatch(const char *target, const char *pat);

/*
wildcard matcher.
Params: str - the target string
pattern - pattern to mathc
Returns: 1 if match, 0 if not.
Notes: ? - match any character
* - match zero or more characters
[?], [*], escapes.
*/
int matchwild(const char *str, const char *pattern)
{
const char *target = str;
const char *pat = pattern;
int gobble;

while( (gobble = chmatch(target, pat)) )
{
target++;
pat += gobble;
}
if(*target == 0 && *pat == 0)
return 1;
else if(*pat == '*')
{
if(pat[1] == 0)
return 1;
while(*target)
if(matchwild(target++, pat+1))
return 1;
}
return 0;
}

/*
match a character.
Parmas: target - target string
pat - pattern string.
Returns: number of pat character matched.
Notes: means that a * in pat will return zero
*/
static int chmatch(const char *target, const char *pat)
{
if(*target == 0 || *pat == 0)
return 0;

if(*pat == '*')
return 0;

if(*target == '?')
{
if(pat[0] == '[' && pat[1] == '?' && pat[2] == ']')
return 3;
else
return 0;
}
if(*target == '*')
{
if(pat[0] == '[' && pat[1] == '*' && pat[2] == ']')
return 3;
else
return 0;
}
if(*target == *pat)
return 1;
if(*pat == '?')
return 1;
return 0;
}

There are a couple of bug here. ** as a trailing pattern does not match
the empty string and the pattern ? does not match ? or *.
 
M

Malcolm McLean

There are a couple of bug here.  ** as a trailing pattern does not match
the empty string and the pattern ? does not match ? or *.
Try this fix
/*
wildcard matcher.
Params: str - the target string
pattern - pattern to mathc
Returns: 1 if match, 0 if not.
Notes: ? - match any character
* - match zero or more characters
[?], [*], escapes.
*/
int matchwild(const char *str, const char *pattern)
{
const char *target = str;
const char *pat = pattern;
int gobble;

while( (gobble = chmatch(target, pat)) )
{
target++;
pat += gobble;
}
if(*target == 0 && *pat == 0)
return 1;
else if(*pat == '*')
{
while(pat[1] == '*')
pat++;
if(pat[1] == 0)
return 1;
while(*target)
if(matchwild(target++, pat+1))
return 1;
}
return 0;
}

/*
match a character.
Parmas: target - target string
pat - pattern string.
Returns: number of pat character matched.
Notes: means that a * in pat will return zero
*/
static int chmatch(const char *target, const char *pat)
{
if(*target == 0 || *pat == 0)
return 0;

if(*target == *pat)
return 1;
if(*pat == '?')
return 1;

if(*pat == '*')
return 0;

if(*target == '?')
{
if(pat[0] == '[' && pat[1] == '?' && pat[2] == ']')
return 3;
else
return 0;
}
if(*target == '*')
{
if(pat[0] == '[' && pat[1] == '*' && pat[2] == ']')
return 3;
else
return 0;
}

return 0;
}
 
B

Ben Bacarisse

Malcolm McLean said:
Try this fix

target pattern comments
*x * does not match when it should
[x] [?] matches when it shouldn't
[x] [*] matches when it shouldn't

<snip>
 
M

Malcolm McLean

Try this  fix

  target   pattern  comments
   *x         *     does not match when it should
   [x]       [?]    matches when it shouldn't
   [x]       [*]    matches when it shouldn't
I checked on Wikipedia and apparently there's no definite syntax for
globs. '\' is used as an escape character in some syntaxes which is
obviously completely unacceptable for DOS. So I'm not sure how to fix
the last two.

The first error is more serious however. Doing the test for '*' before
the test for a match should fix it.

static int chmatch(const char *target, const char *pat)
{
if(*target == 0 || *pat == 0)
return 0;

if(*pat == '*')
return 0;

if(*target == *pat)
return 1;
if(*pat == '?')
return 1;



if(*target == '?')
{
if(pat[0] == '[' && pat[1] == '?' && pat[2] == ']')
return 3;
else
return 0;
}
if(*target == '*')
{
if(pat[0] == '[' && pat[1] == '*' && pat[2] == ']')
return 3;
else
return 0;
}

return 0;
}
 
B

Ben Bacarisse

Malcolm McLean said:
Malcolm McLean said:
On Oct 13, 6:09 pm, Ben Bacarisse <[email protected]> wrote:
There are a couple of bug here.  ** as a trailing pattern does not match
the empty string and the pattern ? does not match ? or *.
Try this  fix

  target   pattern  comments
   *x         *     does not match when it should
   [x]       [?]    matches when it shouldn't
   [x]       [*]    matches when it shouldn't
I checked on Wikipedia and apparently there's no definite syntax for
globs. '\' is used as an escape character in some syntaxes which is
obviously completely unacceptable for DOS. So I'm not sure how to fix
the last two.

I don't see what difference the syntax makes. The error is just a logic
error in the program: you must test for *target == *pat last and the
nesting of the tests that relate to the escaped characters is wrong.
You need:

if (pat[0] == '[' && pat[1] == '?' && pat[2] == ']')
return *target == '?' ? 3 : 0;

i.e. the test on *target must be conditional on the pattern being there
not the other way round.
The first error is more serious however. Doing the test for '*' before
the test for a match should fix it.

Your definition of serious does not match mine. In

del *

a file named *x will be left over, but with the second error, an attempt
to remove a file called ? using

del [?]

might also remove a file called [x] unexpectedly.

<snip>
 
M

Malcolm McLean

Malcolm McLean said:
There are a couple of bug here.  ** as a trailing pattern does not match
the empty string and the pattern ? does not match ? or *.
Try this  fix
  target   pattern  comments
   *x         *     does not match when it should
   [x]       [?]    matches when it shouldn't
   [x]       [*]    matches when it shouldn't
I checked on Wikipedia and apparently there's no definite syntax for
globs. '\' is used as an escape character in some syntaxes which is
obviously completely unacceptable for DOS. So I'm not sure how to fix
the last two.

I don't see what difference the syntax makes.  The error is just a logic
error in the program: you must test for *target == *pat last and the
nesting of the tests that relate to the escaped characters is wrong.
You need:

    if (pat[0] == '[' && pat[1] == '?' && pat[2] == ']')
        return *target == '?' ? 3 : 0;

i.e. the test on *target must be conditional on the pattern being there
not the other way round.
The first error is more serious however. Doing the test for '*' before
the test for a match should fix it.

Your definition of serious does not match mine.  In

  del *

a file named *x will be left over, but with the second error, an attempt
to remove a file called ? using

  del [?]

might also remove a file called [x] unexpectedly.
That could only happen if the user has a file named [x] (where x is
any character) and a file named "?" in the same directory, which is
extremely unlikely.

By adding complex escape syntax you actually make errors more likely,
because the user might get the expression wrong.
 
B

Ben Bacarisse

Malcolm McLean said:
Malcolm McLean said:
On Oct 13, 6:09 pm, Ben Bacarisse <[email protected]> wrote:
There are a couple of bug here.  ** as a trailing pattern does not match
the empty string and the pattern ? does not match ? or *.
Try this  fix
  target   pattern  comments
   *x         *     does not match when it should
   [x]       [?]    matches when it shouldn't
   [x]       [*]    matches when it shouldn't
I checked on Wikipedia and apparently there's no definite syntax for
globs. '\' is used as an escape character in some syntaxes which is
obviously completely unacceptable for DOS. So I'm not sure how to fix
the last two.

I don't see what difference the syntax makes.  The error is just a logic
error in the program: you must test for *target == *pat last and the
nesting of the tests that relate to the escaped characters is wrong.
You need:

    if (pat[0] == '[' && pat[1] == '?' && pat[2] == ']')
        return *target == '?' ? 3 : 0;

i.e. the test on *target must be conditional on the pattern being there
not the other way round.
The first error is more serious however. Doing the test for '*' before
the test for a match should fix it.

Your definition of serious does not match mine.  In

  del *

a file named *x will be left over, but with the second error, an attempt
to remove a file called ? using

  del [?]

might also remove a file called [x] unexpectedly.
That could only happen if the user has a file named [x] (where x is
any character) and a file named "?" in the same directory, which is
extremely unlikely.

No, there does not have to be a file named "?". That may be the reason
some would type that pattern, but the pattern might be in a script or
using in some other code (this is a C function after all) for all sorts
of reasons.
By adding complex escape syntax you actually make errors more likely,
because the user might get the expression wrong.

But you added the escape syntax. All that happened is that you didn't
implement the expected semantics. I am not suggesting adding anything
more -- simply correcting the code. I told you what was needed. It's
tiny alteration, yet you'd rather allow an unintended match because it's
unlikely. That just seems truly bizarre.
 
M

Malcolm McLean

Malcolm McLean said:
There are a couple of bug here.  ** as a trailing pattern does not match
the empty string and the pattern ? does not match ? or *.
Try this  fix
  target   pattern  comments
   *x         *     does not match when it should
   [x]       [?]    matches when it shouldn't
   [x]       [*]    matches when it shouldn't
I checked on Wikipedia and apparently there's no definite syntax for
globs. '\' is used as an escape character in some syntaxes which is
obviously completely unacceptable for DOS. So I'm not sure how to fix
the last two.
I don't see what difference the syntax makes.  The error is just a logic
error in the program: you must test for *target == *pat last and the
nesting of the tests that relate to the escaped characters is wrong.
You need:
    if (pat[0] == '[' && pat[1] == '?' && pat[2] == ']')
        return *target == '?' ? 3 : 0;
i.e. the test on *target must be conditional on the pattern being there
not the other way round.
The first error is more serious however. Doing the test for '*' before
the test for a match should fix it.
Your definition of serious does not match mine.  In
  del *
a file named *x will be left over, but with the second error, an attempt
to remove a file called ? using
  del [?]
might also remove a file called [x] unexpectedly.
That could only happen if the user has a file named [x] (where x is
any character) and a file named "?" in the same directory, which is
extremely unlikely.

No, there does not have to be a file named "?".  That may be the reason
some would type that pattern, but the pattern might be in a script or
using in some other code (this is a C function after all) for all sorts
of reasons.
By adding complex escape syntax you actually make errors more likely,
because the user might get the expression wrong.

But you added the escape syntax.  All that happened is that you didn't
implement the expected semantics.  I am not suggesting adding anything
more -- simply correcting the code.  I told you what was needed.  It's
tiny alteration, yet you'd rather allow an unintended match because it's
unlikely.  That just seems truly bizarre.
The problem is that you can't match the pattern "[ character ]" if you
treat "[?]" as an escape for "?". You need to have some way of
escaping the escape. The approved syntax is "\[?\]", but it's not a
universal standard, and the backslash is a path separator in DOS. If
you add too much then you end up with regular expressions, which are
too difficult for the ordinary user to enter.

So currently "[?]" will match either "[ character ]" or "?". This
isn't ideal, but in practice it's unlikely to cause much difficulty.
 
B

Ben Bacarisse

Malcolm McLean said:
Malcolm McLean said:
On Oct 13, 6:09 pm, Ben Bacarisse <[email protected]> wrote:
There are a couple of bug here.  ** as a trailing pattern does not match
the empty string and the pattern ? does not match ? or *.
Try this  fix
  target   pattern  comments
   *x         *     does not match when it should
   [x]       [?]    matches when it shouldn't
   [x]       [*]    matches when it shouldn't
I checked on Wikipedia and apparently there's no definite syntax for
globs. '\' is used as an escape character in some syntaxes which is
obviously completely unacceptable for DOS. So I'm not sure how to fix
the last two.
I don't see what difference the syntax makes.  The error is just a logic
error in the program: you must test for *target == *pat last and the
nesting of the tests that relate to the escaped characters is wrong.
You need:
    if (pat[0] == '[' && pat[1] == '?' && pat[2] == ']')
        return *target == '?' ? 3 : 0;
i.e. the test on *target must be conditional on the pattern being there
not the other way round.
The first error is more serious however. Doing the test for '*' before
the test for a match should fix it.
Your definition of serious does not match mine.  In
  del *
a file named *x will be left over, but with the second error, an attempt
to remove a file called ? using
  del [?]
might also remove a file called [x] unexpectedly.
That could only happen if the user has a file named [x] (where x is
any character) and a file named "?" in the same directory, which is
extremely unlikely.

No, there does not have to be a file named "?".  That may be the reason
some would type that pattern, but the pattern might be in a script or
using in some other code (this is a C function after all) for all sorts
of reasons.
By adding complex escape syntax you actually make errors more likely,
because the user might get the expression wrong.

But you added the escape syntax.  All that happened is that you didn't
implement the expected semantics.  I am not suggesting adding anything
more -- simply correcting the code.  I told you what was needed.  It's
tiny alteration, yet you'd rather allow an unintended match because it's
unlikely.  That just seems truly bizarre.
The problem is that you can't match the pattern "[ character ]" if you
treat "[?]" as an escape for "?". You need to have some way of
escaping the escape. The approved syntax is "\[?\]", but it's not a
universal standard, and the backslash is a path separator in DOS.

If the behaviour was deliberate, the comment should document it and we
could have avoided this part of the discussion. If the user is warned
that [?] matches more that a single, literal, '?' then I agree that you
can probably get away with it.

However, if you just want [?] and [*] to be escaped versions of ? and *
(as the comment suggest) then I think you should implement that. This
will mean that '[?]' only matches ? but that is less surprising (to me)
than allowing what looks like an escaped character to match something
else. I was suggesting you fix this simply because I think its safer
than having an unexpected match.

To solve the problem fully you don't need any new escape syntax, though
you do have to generalise use of [...]. If you write either:

if (pat[0] == '[' && strchr("[]?*", pat[1]) && pat[2] == ']')
return pat[1] == *target ? 3 : 0;

or (to make [...] always be an escape sequence even when the enclosed
character has no special meaning):

if (pat[0] == '[' && pat[1] && pat[2] == ']')
return pat[1] == *target ? 3 : 0;

then [?] matches only '?'; [[]x] matches only '[x]'; [[]?] matches any
single character in brackets and [[][?]] matches exactly '[?]'. I think
one consistent syntax is better than surprise matches.

This fuller meaning for the square brackets shows why a single escape
character is usually used. `?, ``? and ```? are simpler than [?], [[]?,
and [[][?] (I'm just picking ` out of the air -- I agree that you want
to avoid using \ in MSDOS).
If
you add too much then you end up with regular expressions, which are
too difficult for the ordinary user to enter.

I am not suggesting adding so much that you get REs. I would suggest
that [...] could be simply extended to permit [ and/or ] to be quoted.
You don't need to allow both, but it seem natural so do so.
So currently "[?]" will match either "[ character ]" or "?". This
isn't ideal, but in practice it's unlikely to cause much difficulty.

I would have been happy with that slightly unusual meaning had it been
documented. I don't think writing "[?], [*], escapes" really conveys
that.
 
S

Seebs

That could only happen if the user has a file named [x] (where x is
any character) and a file named "?" in the same directory, which is
extremely unlikely.

Extremely unlikely things happen several times a day.
By adding complex escape syntax you actually make errors more likely,
because the user might get the expression wrong.

I would rather have something which is possible to get wrong, but
also possible to get right, than something slightly harder to get
wrong, but absolutely impossible to get right.

-s
 
M

Malcolm McLean

To solve the problem fully you don't need any new escape syntax, though
you do have to generalise use of [...].  If you write either:
That's a good point.
 
M

Malcolm McLean

To solve the problem fully you don't need any new escape syntax, though
you do have to generalise use of [...].  If you write either:
I've generalised it to match any characters contained within the
brackets, and also allowed ranges.

The problem now is that if the user types [z-a] it's not clear what
the function should do, and [[]?] isn't very visually appealing as a
match for "[ character ]".

static int chmatch(const char *target, const char *pat);

/*
wildcard matcher.
Params: str - the target string
pattern - pattern to match
Returns: 1 if match, 0 if not.
Notes: ? - match any character
* - match zero or more characters
[?], [*], escapes,
[abc], match a, b or c.
[A-Z] [0-9] [*-x], match range.
[[] - match '['.
*/
int matchwild(const char *str, const char *pattern)
{
const char *target = str;
const char *pat = pattern;
int gobble;

while( (gobble = chmatch(target, pat)) )
{
target++;
pat += gobble;
}
if(*target == 0 && *pat == 0)
return 1;
else if(*pat == '*')
{
while(pat[1] == '*')
pat++;
if(pat[1] == 0)
return 1;
while(*target)
if(matchwild(target++, pat+1))
return 1;
}
return 0;
}

/*
match a character.
Parmas: target - target string
pat - pattern string.
Returns: number of pat character matched.
Notes: means that a * in pat will return zero
*/
static int chmatch(const char *target, const char *pat)
{
char *end, *ptr;

if(*pat == '[' && end = strchr(pat, ']') )
{
/* allow [A-Z] and like syntax */
if(end - pat == 4 && pat[2] == '-' && pat[1] <= pat[3])
if(*target >= pat[1] && *target <= pat[3])
return 5;
else
return 0;

/* search for character list contained within brackets */
ptr = strchr(pat+1, *target);
if(ptr != 0 && ptr < end)
return end - pat + 1;
else
return 0;
}

if(*pat == '?' && *target != 0)
return 1;

if(*pat == '*')
return 0;

if(*target == 0 || *pat == 0)
return 0;

if(*target == *pat)
return 1;

return 0;
}
 
B

Ben Bacarisse

Malcolm McLean said:
To solve the problem fully you don't need any new escape syntax, though
you do have to generalise use of [...].  If you write either:
I've generalised it to match any characters contained within the
brackets, and also allowed ranges.

I think you posted an old version. The posted code has a syntax error.

It's mildly unsatisfactory that one can't include ] in a match set since
it might be useful to match all brackets. The usual solution is to
treat a ] immediately following a [ as not being the end of a match set.

<snip>
 
M

Malcolm McLean

Malcolm McLean <[email protected]> writes:
I think you posted an old version.  The posted code has a syntax error.
I cut and pasted a few minutes after it compiled cleanly. I cut and
pasted back and again it compiled cleanly, and I can't see the syntax
error by eye.
Can you tell me what diagnostic your compiler produced?
It's mildly unsatisfactory that one can't include ] in a match set since
it might be useful to match all brackets.  The usual solution is to
treat a ] immediately following a [ as not being the end of a match set.
This is worth doing.
 
B

Ben Bacarisse

Malcolm McLean said:
I cut and pasted a few minutes after it compiled cleanly.

I think you should name and shame that compiler. Which one is it?
I cut and
pasted back and again it compiled cleanly, and I can't see the syntax
error by eye.
Can you tell me what diagnostic your compiler produced?

mm.c: In function 'chmatch':
mm.c:52: warning: incompatible implicit declaration of built-in function 'strchr'
mm.c:52: error: lvalue required as left operand of assignment

It's the second that is the syntax error (missing parentheses).

<snip>
 
K

Keith Thompson

Malcolm McLean said:
I cut and pasted a few minutes after it compiled cleanly. I cut and
pasted back and again it compiled cleanly, and I can't see the syntax
error by eye.
Can you tell me what diagnostic your compiler produced?

The problem is on this line:

if(*pat == '[' && end = strchr(pat, ']') )

"&&" binds more tightly than "=", so the LHS of the assignment is

*pat == '[' && end

which is not an lvalue.

I wonder why your compiler didn't complain about that. (Mine did,
but it took me a while to figure out why.)

[...]
 
M

Malcolm McLean

I think you should name and shame that compiler.  Which one is it?
tcc - tiny C compiler. It's a free and very nice minimal compiler
which I use because there are few installation hassles or Microsoft
death threats for using the compiler for open source projects. But
it's a one man project.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top