filemask to regex

Discussion in 'Perl Misc' started by George Mpouras, Aug 24, 2013.

  1. I want to convert a OS filemask with possible wildcards to regex
    What to you think of the following approach

    $mask = "??-media*.wm?";
    $mask=~s|\*\.\*|\*|g; # *.* -> *
    $mask=~s|\.|\\.|g; # . -> \.
    $mask=~s|\?|.|g; # ? -> .
    $mask=~s|\*|.*?|g; # * -> .*?
    $mask=~s/(\(|\)|\+|\^|\[|\]|\{|\}|\$|\@|\%)/\\$1/g; #escape ()+^[]{}$@%
    $mask = qr/^$mask$/i;
    George Mpouras, Aug 24, 2013
    #1
    1. Advertising

  2. George Mpouras <> writes:
    > I want to convert a OS filemask with possible wildcards to regex
    > What to you think of the following approach
    >
    > $mask = "??-media*.wm?";
    > $mask=~s|\*\.\*|\*|g; # *.* -> *


    [...]

    > $mask=~s|\*|.*?|g; # * -> .*?


    This sequence of conversion is wrong because it will translate *.* to
    ..*?, ie, something which matches a string with not . in it.
    Rainer Weikusat, Aug 25, 2013
    #2
    1. Advertising

  3. George Mpouras <> writes:
    > I want to convert a OS filemask with possible wildcards to regex
    > What to you think of the following approach
    >
    > $mask = "??-media*.wm?";
    > $mask=~s|\*\.\*|\*|g; # *.* -> *
    > $mask=~s|\.|\\.|g; # . -> \.
    > $mask=~s|\?|.|g; # ? -> .
    > $mask=~s|\*|.*?|g; # * -> .*?
    > $mask=~s/(\(|\)|\+|\^|\[|\]|\{|\}|\$|\@|\%)/\\$1/g; #escape ()+^[]{}$@%
    > $mask = qr/^$mask$/i;


    I think I would again prefer to do a part-by-part lexical analysis of
    the input, mainly because this means that quotemeta can be used to
    quote metacharacters in the 'text' parts:

    ------------------
    sub xlate_tin_pattern
    {
    my $out;

    for ($_[0]) {
    /\G(\?+)/gc && do {
    $out .= '.' x length($1);
    redo;
    };

    /\G\*+/gc && do {
    $out .= '.*?';
    redo;
    };

    /\G([^?*]+)/g && do {
    $out .= quotemeta($1);
    redo;
    };
    }

    return $out;
    }

    print(xlate_tin_pattern($_), "\n") for @ARGV;
    Rainer Weikusat, Aug 25, 2013
    #3
  4. Στις 25/8/2013 7:52 μμ, ο/η Rainer Weikusat έγÏαψε:
    > George Mpouras <> writes:
    >> I want to convert a OS filemask with possible wildcards to regex
    >> What to you think of the following approach
    >>
    >> $mask = "??-media*.wm?";
    >> $mask=~s|\*\.\*|\*|g; # *.* -> *
    >> $mask=~s|\.|\\.|g; # . -> \.
    >> $mask=~s|\?|.|g; # ? -> .
    >> $mask=~s|\*|.*?|g; # * -> .*?
    >> $mask=~s/(\(|\)|\+|\^|\[|\]|\{|\}|\$|\@|\%)/\\$1/g; #escape ()+^[]{}$@%
    >> $mask = qr/^$mask$/i;

    >
    > I think I would again prefer to do a part-by-part lexical analysis of
    > the input, mainly because this means that quotemeta can be used to
    > quote metacharacters in the 'text' parts:
    >
    > ------------------
    > sub xlate_tin_pattern
    > {
    > my $out;
    >
    > for ($_[0]) {
    > /\G(\?+)/gc && do {
    > $out .= '.' x length($1);
    > redo;
    > };
    >
    > /\G\*+/gc && do {
    > $out .= '.*?';
    > redo;
    > };
    >
    > /\G([^?*]+)/g && do {
    > $out .= quotemeta($1);
    > redo;
    > };
    > }
    >
    > return $out;
    > }
    >
    > print(xlate_tin_pattern($_), "\n") for @ARGV;
    >



    very good !
    but the line
    /\G(\?+)/gc && do { $out .= '.' x length($1); redo };
    it fries my brain.
    So I think I stick with the equivelant f1()











    print xlate_tin_pattern('@s??im..pl%e.???a'), "\n";
    print f1('@s??im..pl%e.???a'), "\n";


    sub xlate_tin_pattern
    {
    my $out;
    for ($_[0]){
    /\G(\?+)/gc && do { $out .= '.' x length($1); redo };
    /\G\*+/gc && do { $out .= '.*?'; redo };
    /\G([^?*]+)/g && do { $out .= quotemeta($1); redo }}
    $out
    }


    sub f1
    {
    $out=$_[0];
    $out=~s/([^?*]+)/\Q$1\E/g;
    $out=~s|\?|.|g;
    $out=~s|\*+|.*?|g;
    $out
    }
    George Mpouras, Aug 25, 2013
    #4
  5. Rainer Weikusat <> writes:

    > George Mpouras <> writes:
    >> I want to convert a OS filemask with possible wildcards to regex
    >> What to you think of the following approach
    >>
    >> $mask = "??-media*.wm?";
    >> $mask=~s|\*\.\*|\*|g; # *.* -> *

    >
    > [...]
    >
    >> $mask=~s|\*|.*?|g; # * -> .*?

    >
    > This sequence of conversion is wrong because it will translate *.* to
    > .*?, ie, something which matches a string with not . in it.


    I think that may be deliberate. I was going to ask "what OS>", but when
    I saw that, I remembered that in MS-DOS (and maybe others), *.* means
    all files. Similarly X*.* means all file beginning with X. (The reason
    being that the . is not in the file name, just in the presentation of
    it, though I still think that's a weak argument.)

    I'm not saying the translation is correct -- I can't remember all of
    MS-DOS's rules, and it's likely to be wrong of the target is not an
    MS-DOS-like OS.

    --
    Ben.
    Ben Bacarisse, Aug 25, 2013
    #5
  6. Ben Bacarisse <> writes:
    > Rainer Weikusat <> writes:
    >> George Mpouras <> writes:
    >>> I want to convert a OS filemask with possible wildcards to regex
    >>> What to you think of the following approach
    >>>
    >>> $mask = "??-media*.wm?";
    >>> $mask=~s|\*\.\*|\*|g; # *.* -> *

    >>
    >> [...]
    >>
    >>> $mask=~s|\*|.*?|g; # * -> .*?

    >>
    >> This sequence of conversion is wrong because it will translate *.* to
    >> .*?, ie, something which matches a string with not . in it.

    >
    > I think that may be deliberate. I was going to ask "what OS>", but when
    > I saw that, I remembered that in MS-DOS (and maybe others), *.* means
    > all files. Similarly X*.* means all file beginning with X. (The reason
    > being that the . is not in the file name, just in the presentation of
    > it, though I still think that's a weak argument.)


    'DOS filenames'' (and very likely VMS filenames as well) are not plain
    strings but consist of two components, a 'name' part and a 'type'
    part, and because of this, *.* means 'all names and all types', ie
    'every file'. If the input these patterns are supposed to be matched
    against is really a list of 'DOS filenames', translating *.* to .*\..*
    (or .+\..+) instead of .* (or .+) will make no difference because the
    extension is always going to be there. But when it was just a list of
    strings, making '*.*' match both abc and abc.def is IMHO
    counterintuitive. It also precludes some possibly useful applications
    such as 'match everything which has an extension'.
    Rainer Weikusat, Aug 26, 2013
    #6
  7. Στις 26/8/2013 1:50 πμ, ο/η Ben Bacarisse έγÏαψε:
    >
    > I'm not saying the translation is correct -- I can't remember all of
    > MS-DOS's rules, and it's likely to be wrong of the target is not an
    > MS-DOS-like OS.
    >


    yes you are corrrect it was intented, at windows the *.* means * !
    but the Rainer aproach at his other answer is very clever and correct
    George Mpouras, Aug 26, 2013
    #7
  8. if we forget the windows at bash there is also the interesting range
    operator !

    ls -l somefile{01,02,03,07}
    ls -l somefile{01..05}
    George Mpouras, Aug 26, 2013
    #8
  9. Rainer Weikusat <> writes:

    > Ben Bacarisse <> writes:
    >> Rainer Weikusat <> writes:
    >>> George Mpouras <> writes:
    >>>> I want to convert a OS filemask with possible wildcards to regex
    >>>> What to you think of the following approach
    >>>>
    >>>> $mask = "??-media*.wm?";
    >>>> $mask=~s|\*\.\*|\*|g; # *.* -> *
    >>>
    >>> [...]
    >>>
    >>>> $mask=~s|\*|.*?|g; # * -> .*?
    >>>
    >>> This sequence of conversion is wrong because it will translate *.* to
    >>> .*?, ie, something which matches a string with not . in it.

    >>
    >> I think that may be deliberate. I was going to ask "what OS>", but when
    >> I saw that, I remembered that in MS-DOS (and maybe others), *.* means
    >> all files. Similarly X*.* means all file beginning with X. (The reason
    >> being that the . is not in the file name, just in the presentation of
    >> it, though I still think that's a weak argument.)

    >
    > 'DOS filenames'' (and very likely VMS filenames as well) are not plain
    > strings but consist of two components, a 'name' part and a 'type'
    > part, and because of this, *.* means 'all names and all types', ie
    > 'every file'. If the input these patterns are supposed to be matched
    > against is really a list of 'DOS filenames', translating *.* to .*\..*
    > (or .+\..+) instead of .* (or .+) will make no difference because the
    > extension is always going to be there.


    I don't follow. If I get a list of DOS file names using, say, DIR,
    those with no extension have no dot. .*\.\* won't match them but .*
    will. You can write a file with no extension as "XYZ." as well as "XYZ"
    but, IIRC, many programs dropped the '.' if there was no extension.

    > But when it was just a list of
    > strings, making '*.*' match both abc and abc.def is IMHO
    > counterintuitive. It also precludes some possibly useful applications
    > such as 'match everything which has an extension'.


    I must be missing your point because I don't follow this either. The
    DOS way to match names with an extension was to write *.?*, and the
    suggested translation will work for that.

    --
    Ben.
    Ben Bacarisse, Aug 26, 2013
    #9
  10. George Mpouras

    Dr.Ruud Guest

    On 24/08/2013 16:03, George Mpouras wrote:

    > I want to convert a OS filemask with possible wildcards to regex
    > What to you think of the following approach
    >
    > $mask = "??-media*.wm?";
    > $mask=~s|\*\.\*|\*|g; # *.* -> *
    > $mask=~s|\.|\\.|g; # . -> \.
    > $mask=~s|\?|.|g; # ? -> .
    > $mask=~s|\*|.*?|g; # * -> .*?
    > $mask=~s/(\(|\)|\+|\^|\[|\]|\{|\}|\$|\@|\%)/\\$1/g; #escape ()+^[]{}$@%
    > $mask = qr/^$mask$/i;


    Also checkout `perldoc -f glob`.

    --
    Ruud
    Dr.Ruud, Aug 26, 2013
    #10
  11. Rainer Weikusat <> writes:

    [...]

    > I was under the not entirely correct expression that


    [...]

    Personal remark which seems appropriate here: Except when tightly
    supervised, my mind has a tendency to invert things, as can be seen
    here where I thought 'under the impression' and thus - without
    noticing that - typed 'under the expression'. It's been a while since
    I accidentally implemented an algortihm doing the exact opposite of
    what it was supposed to do, but these days, I spend more time thinking
    about the code I'm planning to write and less typing away and hashing
    stuff out as the need arises which is probably the reason for
    that. But it still happens fairly often in 'ordinary text'.
    Rainer Weikusat, Aug 27, 2013
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    698
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,622
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    596
  4. Replies:
    3
    Views:
    754
    Reedick, Andrew
    Jul 1, 2008
  5. Amy Lee

    Help: Filemask problem

    Amy Lee, Oct 14, 2007, in forum: Perl Misc
    Replies:
    4
    Views:
    199
    Amy Lee
    Oct 14, 2007
Loading...

Share This Page