Problem with regular expression for matching the url endings

E

erenay

Hi, I have written a regular expression in order to choose some url
addresses that interrest me from an access log file.
I want to choose addresses that start with "http://" and end with
".html", ".htm", ".asp", ".php", ".aspx" or with a number.
The following pattern seems to only accept url's ending with ".html" or
".htm"
Does anybody has an idea why it doesn't recognize url's with other
endings?


The pattern I use is:
Pattern htmHtml = Pattern.compile("^(http://)\\S+((\\.htm) | (\\.html)
| (\\.asp) | (\\.php)| (\\.aspx) | / | (\\d))$");

It doesn't recognise the following url's:

http://www.galatasaray.org/Futbol/GS/anket/anket.asp
http://bimonline.insites.be/common/CookieCheck.asp?siteID=2382&TagId=1&Pad=tr&Lang=tr&Country=tr&b=1
http://www.aksiyon.com.tr/sonsayi210.php

It's possible that the problem is somewhere else in the code but I
wondered if you see something wrong in my pattern.

Regards,
Eren Aykin
 
O

Oliver Wong

erenay said:
Hi, I have written a regular expression in order to choose some url
addresses that interrest me from an access log file.
I want to choose addresses that start with "http://" and end with
".html", ".htm", ".asp", ".php", ".aspx" or with a number.
The following pattern seems to only accept url's ending with ".html" or
".htm"
Does anybody has an idea why it doesn't recognize url's with other
endings?


The pattern I use is:
Pattern htmHtml = Pattern.compile("^(http://)\\S+((\\.htm) | (\\.html)
| (\\.asp) | (\\.php)| (\\.aspx) | / | (\\d))$");

It doesn't recognise the following url's:

http://www.galatasaray.org/Futbol/GS/anket/anket.asp
http://bimonline.insites.be/common/CookieCheck.asp?siteID=2382&TagId=1&Pad=tr&Lang=tr&Country=tr&b=1
http://www.aksiyon.com.tr/sonsayi210.php

Are you sure it accepts those that end with ".html"?

Could it have something to do with all those whitespaces in the pattern?

- Oliver
 
O

Oliver Wong

erenay said:
Are you sure it accepts those that end with ".html"?
Could it have something to do with all those whitespaces in the
pattern?

You were right Oliver, the previous pattern matched only ".htm"s

I tried the pattern:
Pattern.compile("^(http://)\\S+[(\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|/|(\\d+)]$");
And it doesn't match any URL's.
How should I do it?

I'm not familiar with Java's particular variant of regular expressions,
but maybe the new problem is your addition of the square brackets. Did you
try the expression nkalagarla gave you?

<quote>
Try this.

Pattern.compile("^(http://)\\S+((\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|(\\d))$");
</quote>

- Oliver
 
J

Jussi Piitulainen

Oliver said:
erenay said:
I tried the pattern:
Pattern.compile("^(http://)\\S+[(\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|/|(\\d+)]$");
And it doesn't match any URL's.
How should I do it?

I'm not familiar with Java's particular variant of regular
expressions, but maybe the new problem is your addition of the
square brackets.

Certainly. Another problem may be the anchors ^$. See javadoc about
Pattern.MULTILINE flag and maybe enable that:

Pattern.compile("...", Pattern.MULTILINE)
Did you try the expression nkalagarla gave you?

<quote>
Try this.

Pattern.compile("^(http://)\\S+((\\.htm)|(\\.html)|(\\.asp)|(\\.php)|(\\.aspx)|(\\d))$");
</quote>

Or "http://\\S+(\\.html|\\.htm|\\.asp|\\.php|\\.aspx|\\d)", possibly
with anchors, possibly in multiline mode.

\S+? might be more appropriate than \S+, especially if \d is replaced
with \d+.

Some of these things depend on how the matcher is used.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top