Need help with regular expression to parse URLs

N

Neil

Firstly, the repeated group as written has no way to admit slashes
*between* pairs of path elements.

Yes! I see it now. Thank you.
Secondly, you get one matching group per occurrence of a capturing group
in the *pattern*, not per occurrence of the subpattern in the match. That
is, if the above pair group matches five times, you'll still only get a
single pair of captured groups (the last ones). That, i think, means
there's no way to use a regular expression to do what you want to do here..

I did not realize this was a limitation of the regex matching.

I will use split.

Thanks,
Neil
 
M

markspace

Roedy said:
Complicated regexes are such a bitch to debug. We need a tool that
shows you just how far it got.


I use a little regex tester that I wrote. It's a Jar file that pops up
a gui that allows me to test regex against a pattern. It's really handy
and faster than making a new project in the IDE, and running compiles
against a Java string.

I'll post it if you think it would be generally useful.
 
R

Roedy Green

Writing a loop to iterate over the elements of the chunks array in pairs
is a pain, but a very minor one.

You don't even have to. Split tosses out the '/'s for you. You just
have to choose a magic subscript to bypass the unwanted lead fields.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
R

Roedy Green

I did not realize this was a limitation of the regex matching.

Regexes have a number of limitations. They can't, for example ensure
() are balanced. When they run out of steam, try a parser.

See http://mindprod.com/jgloss/parser.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
R

Roedy Green

If you can write a custom parser in two minutes,

The problem with regexes is I never feel confident they are fully
debugged. With all the greedy/reluctant stuff the expected behaviour
becomes a matter of experiment, rather than something you just read.

Except for very simple ones, I never feel fully confident they are
correct.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
R

Roedy Green

That hurted my brain

I think it was an entry in an obsured coding contest.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
T

Tom Anderson

You don't even have to. Split tosses out the '/'s for you. You just
have to choose a magic subscript to bypass the unwanted lead fields.

No, the OP wanted to process path elements in pairs. If he had:

prefix/a/b/c/d.html

Then he wanted to get two pairs

a + b
c + d

Split will give you an array {a, b, c, d}. You need to write something
like:

String[] elements = path.split("/");
for (int i = 0; i < elements.length; i += 2) {
String first = elements;
String second = elements[i + 1];
}

Except that as you say, you need to stick in a magic subscript to skip
over the boring bits at the start of the path.

tom
 
T

Tom Anderson

Regexes have a number of limitations. They can't, for example ensure
() are balanced.

A popular myth!

Regular languages cannot balance parentheses. But regular expressions as
we know and use them outgrew being a regular language years ago - once you
have backreferences and other modern conveniences, you *can* do things
like balancing parens. Somehow.

tom
 
R

RedGrittyBrick

Roedy said:
I think it was an entry in an obsured coding contest.

I wouldn't be so sure.

http://www.perlmonks.org/?node_id=183830

"And then there's my URL matcher. A bit outdated, as it only matches
HTTP, FTP, News, NNTP, telnet, gopher, WAIS, mailto, file, prospero,
LDAP, z39.50, CID, MID, VEMMI, IMAP and NFS URLs. Many other URLs
schemes have seen the light the last 5 years. One of these days, I'll
update the regex...."

I suspect[1] the monster is *not* deliberately obfuscated. It's just
that the space of valid URLs is monstrously large and complex.

Abigail is the author of Perl's Regexp::Common module, amongst others.


[1] My brain hurted too much to be sure.
 
L

Lew

markspace wrote, quoted or indirectly quoted someone who said :
I wouldn't be so sure.

http://www.perlmonks.org/?node_id=183830

"And then there's my URL matcher. A bit outdated, as it only matches
HTTP, FTP, News, NNTP, telnet, gopher, WAIS, mailto, file, prospero,
LDAP, z39.50, CID, MID, VEMMI, IMAP and NFS URLs. Many other URLs
schemes have seen the light the last 5 years. One of these days, I'll
update the regex...."

I suspect[1] the monster is *not* deliberately obfuscated. It's just
that the space of valid URLs is monstrously large and complex.

Abigail is the author of Perl's Regexp::Common module, amongst others.


[1] My brain hurted too much to be sure.

Besides, obfuscating regex is like dampening water.
 
T

Tom Anderson

markspace wrote, quoted or indirectly quoted someone who said :
I wouldn't be so sure.

http://www.perlmonks.org/?node_id=183830

"And then there's my URL matcher. A bit outdated, as it only matches HTTP,
FTP, News, NNTP, telnet, gopher, WAIS, mailto, file, prospero, LDAP,
z39.50, CID, MID, VEMMI, IMAP and NFS URLs. Many other URLs schemes have
seen the light the last 5 years. One of these days, I'll update the
regex...."

I suspect[1] the monster is *not* deliberately obfuscated. It's just that
the space of valid URLs is monstrously large and complex.

That actually looks like a pretty straightforward regexp to me. It just
has loads of nested non-capturing groups, which are not easy on the eye.
[1] My brain hurted too much to be sure.

Besides, obfuscating regex is like dampening water.

Alternatively, pissing in an ocean of piss.

tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top