Need help with regular expression to parse URLs

Tom Anderson · Aug 10, 2009

http://web.archive.org/web/20070705044149/http://www.foad.org/~abigail/Perl/url3.regex

Nice. But since all those groups are non-capturing, completely bloody
useless!

tom

Neil · Aug 10, 2009

Firstly, the repeated group as written has no way to admit slashes
*between* pairs of path elements.

Yes! I see it now. Thank you.

Secondly, you get one matching group per occurrence of a capturing group
in the *pattern*, not per occurrence of the subpattern in the match. That
is, if the above pair group matches five times, you'll still only get a
single pair of captured groups (the last ones). That, i think, means
there's no way to use a regular expression to do what you want to do here..

I did not realize this was a limitation of the regex matching.

I will use split.

Thanks,
Neil

markspace · Aug 11, 2009

Roedy said:
Complicated regexes are such a bitch to debug. We need a tool that
shows you just how far it got.

I use a little regex tester that I wrote. It's a Jar file that pops up
a gui that allows me to test regex against a pattern. It's really handy
and faster than making a new project in the IDE, and running compiles
against a Java string.

I'll post it if you think it would be generally useful.

markspace · Aug 11, 2009

Stefan said:
http://web.archive.org/web/20070705044149/http://www.foad.org/~abigail/Perl/url3.regex

That hurted my brain.

Roedy Green · Aug 11, 2009

Writing a loop to iterate over the elements of the chunks array in pairs
is a pain, but a very minor one.

You don't even have to. Split tosses out the '/'s for you. You just
have to choose a magic subscript to bypass the unwanted lead fields.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.

Roedy Green · Aug 11, 2009

I did not realize this was a limitation of the regex matching.

Regexes have a number of limitations. They can't, for example ensure
() are balanced. When they run out of steam, try a parser.

See http://mindprod.com/jgloss/parser.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.

Roedy Green · Aug 11, 2009

If you can write a custom parser in two minutes,

The problem with regexes is I never feel confident they are fully
debugged. With all the greedy/reluctant stuff the expected behaviour
becomes a matter of experiment, rather than something you just read.

Except for very simple ones, I never feel fully confident they are
correct.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.

Roedy Green · Aug 11, 2009

That hurted my brain

I think it was an entry in an obsured coding contest.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.

Tom Anderson · Aug 11, 2009

You don't even have to. Split tosses out the '/'s for you. You just
have to choose a magic subscript to bypass the unwanted lead fields.

No, the OP wanted to process path elements in pairs. If he had:

prefix/a/b/c/d.html

Then he wanted to get two pairs

a + b
c + d

Split will give you an array {a, b, c, d}. You need to write something
like:

String[] elements = path.split("/");
for (int i = 0; i < elements.length; i += 2) {
String first = elements;
String second = elements[i + 1];
}

Except that as you say, you need to stick in a magic subscript to skip
over the boring bits at the start of the path.

tom

Tom Anderson · Aug 11, 2009

Regexes have a number of limitations. They can't, for example ensure
() are balanced.

A popular myth!

Regular languages cannot balance parentheses. But regular expressions as
we know and use them outgrew being a regular language years ago - once you
have backreferences and other modern conveniences, you *can* do things
like balancing parens. Somehow.

tom

RedGrittyBrick · Aug 12, 2009

Roedy said:
I think it was an entry in an obsured coding contest.

I wouldn't be so sure.

http://www.perlmonks.org/?node_id=183830

"And then there's my URL matcher. A bit outdated, as it only matches
HTTP, FTP, News, NNTP, telnet, gopher, WAIS, mailto, file, prospero,
LDAP, z39.50, CID, MID, VEMMI, IMAP and NFS URLs. Many other URLs
schemes have seen the light the last 5 years. One of these days, I'll
update the regex...."

I suspect[1] the monster is *not* deliberately obfuscated. It's just
that the space of valid URLs is monstrously large and complex.

Abigail is the author of Perl's Regexp::Common module, amongst others.

[1] My brain hurted too much to be sure.

Lew · Aug 12, 2009

markspace wrote, quoted or indirectly quoted someone who said :

I wouldn't be so sure.

http://www.perlmonks.org/?node_id=183830

"And then there's my URL matcher. A bit outdated, as it only matches
HTTP, FTP, News, NNTP, telnet, gopher, WAIS, mailto, file, prospero,
LDAP, z39.50, CID, MID, VEMMI, IMAP and NFS URLs. Many other URLs
schemes have seen the light the last 5 years. One of these days, I'll
update the regex...."

I suspect[1] the monster is *not* deliberately obfuscated. It's just
that the space of valid URLs is monstrously large and complex.

Abigail is the author of Perl's Regexp::Common module, amongst others.

[1] My brain hurted too much to be sure.

Besides, obfuscating regex is like dampening water.

Tom Anderson · Aug 13, 2009

markspace wrote, quoted or indirectly quoted someone who said :

I wouldn't be so sure.

http://www.perlmonks.org/?node_id=183830

"And then there's my URL matcher. A bit outdated, as it only matches HTTP,
FTP, News, NNTP, telnet, gopher, WAIS, mailto, file, prospero, LDAP,
z39.50, CID, MID, VEMMI, IMAP and NFS URLs. Many other URLs schemes have
seen the light the last 5 years. One of these days, I'll update the
regex...."

I suspect[1] the monster is *not* deliberately obfuscated. It's just that
the space of valid URLs is monstrously large and complex.

Click to expand...

That actually looks like a pretty straightforward regexp to me. It just
has loads of nested non-capturing groups, which are not easy on the eye.

[1] My brain hurted too much to be sure.

Click to expand...

Besides, obfuscating regex is like dampening water.

Alternatively, pissing in an ocean of piss.

tom

extracting urls	7	Nov 18, 2007
Big problem I need to solve with some unix utils	1	Jun 19, 2022
Help with my responsive home page	2	Dec 14, 2022
URLs	3	Aug 2, 2005
Please help me to solve this JS problem	6	Aug 8, 2023
need regular expression to replace part of result based on a search pattern	13	Jul 11, 2012
regular expression to parse {"hello", "hello world","1hello-2*hello"}	6	Jan 6, 2008
Help with regular expression	2	Aug 21, 2006

Need help with regular expression to parse URLs

Tom Anderson

Neil

markspace

markspace

Roedy Green

Roedy Green

Roedy Green

Roedy Green

Tom Anderson

Tom Anderson

RedGrittyBrick

Lew

Tom Anderson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads