how greedy is nongreedy in regexp ?

peter pilsl · Nov 8, 2004

the following substitution does not what I want. So I ask myself where
the knot in my brain is this time.

I want to clean up urls and make a/b/c/../e/f => a/b/e/f

$a="a/b/c/../e/f";
$a=~s,/(.*?)/\.\./,/,;
print $1,"\n",$a,"\n"

which gives:

b/c
a/e/f

Thats still to greedy to me

my desired match would be "c"
Is there a way to achieve this? Maybe some look-behind-hack? I'm not
familiar with this.

My alternative approach would be to simply split the whole stuff and
iterate throught it. This approach - however - seems to be more
costintensive than I'd like it to be.

thnx,
peter

Anno Siegel · Nov 8, 2004

peter pilsl said:
the following substitution does not what I want. So I ask myself where
the knot in my brain is this time.

I want to clean up urls and make a/b/c/../e/f => a/b/e/f

The standard module File::Spec has canonpath() to do this. Since
canonpath() does a logical cleanup without looking at a physical
file system, it should work for URLs.

Anno

Darren Dunham · Nov 8, 2004

The standard module File::Spec has canonpath() to do this. Since
canonpath() does a logical cleanup without looking at a physical
file system, it should work for URLs.

Note that this is not a valid cleanup in the presence of symbolic links.
So when run on Unix (or Mac or OS2), File::Spec->canonpath() will not
convert the string.

$ perl -MFile::Spec -le 'print File::Spec->canonpath("a/b/c/../d/e");'
a/b/c/../d/e

Because Win32 filesystems don't support symlinks, you can explicitly run
the win32 module to do this...

$ perl -MFile::Spec::Win32 -le 'print File::Spec::Win32->canonpath("a/b/c/../d/e");'
a\b\d\e

Whether you can cope with the backslashes is another question....

Brian McCauley · Nov 10, 2004

Anno said:
The standard module File::Spec has canonpath() to do this.

But since this is a question about canonicalising URLs it would seem
more appropriate to use the URI module. However the canonical() method
of URI doesn't do this. Is this right or should it be considered a bug
in URI?

Brian McCauley · Nov 10, 2004

peter said:
Subject: Re: how greedy is nongreedy in regexp ?

[ snip ]

Others have answered the question in the mssage body, but to answer the
question in the subject line...

Non-greedyness is local. Non-greedyness does _not_ trump finding the
leftmost match. When ther are two (non-)gready subexpressions the first
ones (non-)greadyness tekes priority.

Stuart Moore · Nov 11, 2004

Brian said:
But since this is a question about canonicalising URLs it would seem
more appropriate to use the URI module. However the canonical() method
of URI doesn't do this. Is this right or should it be considered a bug
in URI?

Is www.foo.com/a/b/../c/d always the same as www.foo.com/a/c/d ?

I'd guess on most webservers it is, but surely you can't guarantee it in
general. Suppose the dots were instead input to a script?

Stuart

Brian McCauley · Nov 11, 2004

Stuart said:
Is www.foo.com/a/b/../c/d always the same as www.foo.com/a/c/d ?

I'd guess on most webservers it is, but surely you can't guarantee it in
general.

I think you can garantee it on standards-conforming web servers. At
least that's the implication of of the W3C standards for URIs.

Suppose the dots were instead input to a script?

Then at least one of them must be encoded.

So I think it's fairly clear from the standards that

http://foo/a/b/c/../../d/./e/f

Should canonicalize to

http://foo/a/d/e/f

It also follows that both of...

http://foo/a/b/../c
http://foo/a/b/../c

...should canonicalise to the same thing and it shouldn't be..

http://foo/a/b/../c

...because that's a non-canonical form of..

http://foo/a/c

It is however unclear to me what the correct canonical form would be. I
think it should probably be

http://foo/a/b/../c

Joe Schaefer · Nov 11, 2004

Brian McCauley said:
But since this is a question about canonicalising URLs it would seem more
appropriate to use the URI module. However the canonical() method of URI
doesn't do this. Is this right or should it be considered a bug in URI?

I think it mostly depends on whether or not the URI is relative
or absolute. According to rfc 2396 Section 4:

URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]

The syntax for relative URI is a shortened form of that for absolute
URI, where some prefix of the URI is missing and certain path
components ("." and "..") have a special meaning when, and only when,
interpreting a relative path. The relative URI syntax is defined in
Section 5.

Section 5.2 step 6 talks about the semantics of "../", but that's only
in the narrow context of resolving a relative uri into an absolute one.
AFAICT 2396 discusses "canonicalization" of an absolute URI only as it
relates to %XX escapes.

Ben Morrow · Nov 11, 2004

Quoth Brian McCauley said:
But since this is a question about canonicalising URLs it would seem
more appropriate to use the URI module. However the canonical() method
of URI doesn't do this. Is this right or should it be considered a bug
in URI?

This is right. The only time .. and . are significant in a URI is when
they are on the left-hand end of a relative URI which is being made
absolute, when they are specified to mean what you'd expect.

I would consider this a bug in the URI spec, myself; as it means that,
say,

http://foo/a/./b

is a valid URI distinct from

http://foo/a/b

; but neither can be made relative to

http://foo/a

without losing the distinction. Ach, well.

Ben

Brian McCauley · Nov 13, 2004

Ben said:
This is right. The only time .. and . are significant in a URI is when
they are on the left-hand end of a relative URI which is being made
absolute, when they are specified to mean what you'd expect.

I would consider this a bug in the URI spec, myself; as it means that,
say,

http://foo/a/./b

is a valid URI distinct from

http://foo/a/b

That's what the RFC says but I think I found another (later?) document
on W3C that says that /./ or /../ are invalid in absolute URLs and must
be written using %2E.

Brian McCauley · Nov 13, 2004

Brian said:
That's what the RFC says but I think I found another (later?) document
on W3C that says that /./ or /../ are invalid in absolute URLs and must
be written using %2E.

Found it. It's still a draft but it's also widespread current practice.

http://www.gbiv.com/protocols/uri/rev-2002/draft-fielding-uri-rfc2396bis-07.html#path

and

http://www.gbiv.com/protocols/uri/r...-uri-rfc2396bis-07.html#relative-dot-segments

"[...] removing the special "." and ".." complete path
segments from referenced path. This is done [...] whether
or not the path was relative, [...]"

Discussion leading to this change can be found here:

http://www.gbiv.com/protocols/uri/rev-2002/issues.html#033-dot-segments

Brian McCauley · Nov 13, 2004

Joe said:
I think it mostly depends on whether or not the URI is relative
or absolute. According to rfc 2396

Yes, I was getting ahead of myself. You are, of course, right about
RFC2396, I was looking at the draft of it's successor.

I think it would be helpful if the URI module had an option to implement
the new semantics since they are already widespread in other software.

Maybe I'll submit a patch.

Perl regex - How to make my greedy quantifier greedier?	1	May 17, 2013
Regexp: non greedy?	2	May 20, 2008
Blue J Ciphertext Program	2	Nov 22, 2023
Function is not worked in C	2	Jun 27, 2023
regexp s// too greedy	10	Nov 11, 2004
help with regexp	5	Feb 7, 2013
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
too greedy of a regexp	3	Nov 9, 2006

how greedy is nongreedy in regexp ?

peter pilsl

Anno Siegel

Darren Dunham

Brian McCauley

Brian McCauley

Stuart Moore

Brian McCauley

Joe Schaefer

Ben Morrow

Brian McCauley

Brian McCauley

Brian McCauley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads