Method for normalizing URL?

C

Chris

Is there a good Java method someplace for normalizing or standardizing a
URL?

I'm looking for something that lower cases the protocol, strips trailing
slashes, encodes characters correctly, and does all the other things that I
will no doubt forget if I try to write this on my own.

Any decent crawler is going to need a function like this.
 
R

Ryan Stewart

Chris said:
Is there a good Java method someplace for normalizing or standardizing a
URL?

I'm looking for something that lower cases the protocol, strips trailing
slashes, encodes characters correctly, and does all the other things that I
will no doubt forget if I try to write this on my own.

Any decent crawler is going to need a function like this.
Lower case protocol will not likely make much difference. RFCs 1738 and 2396
(which cover URLs and URIs, respectively) both state that the scheme should
essentially be case insensitive. Certainly it's possible, though, that some
servers out there are still in the stone age concerning this.

As for trailing slashes, I assume you mean you wish to remove the final '/' from
something like:
http://some.server.com/path/

The RFCs specify that the final '/' is optional. However in this case I know
that some Oracle servers will not find the resource if you omit the final '/'.
It might be a configuration thing.

Finally, character encoding:
java.net.URLEncoder.encode(String s, String enc)
 
C

Chris

Ryan Stewart said:
Lower case protocol will not likely make much difference. RFCs 1738 and 2396
(which cover URLs and URIs, respectively) both state that the scheme should
essentially be case insensitive. Certainly it's possible, though, that some
servers out there are still in the stone age concerning this.

As for trailing slashes, I assume you mean you wish to remove the final '/' from
something like:
http://some.server.com/path/

The RFCs specify that the final '/' is optional. However in this case I know
that some Oracle servers will not find the resource if you omit the final '/'.
It might be a configuration thing.

Finally, character encoding:
java.net.URLEncoder.encode(String s, String enc)

You misunderstand my purpose. I'd like to convert a given URL into a
canonical form so that I can know whether two similar URLs are in fact the
same. You might use such a function in a crawler, for example, to avoid
fetching the same page more than once.
 
H

Harish

How about URL.equal()? The java doc days
Two URL objects are equal if they have the same protocol, reference
equivalent hosts, have the same port number on the host, and the same file
and fragment of the file.

not sure if this will take care of the trailing spaces.....also hosts are
compared by IP address, so that(IP address lookup) may turn out to be
expensive.....
 
A

Andrey Kuznetsov

You misunderstand my purpose. I'd like to convert a given URL into a
canonical form so that I can know whether two similar URLs are in fact the
same. You might use such a function in a crawler, for example, to avoid
fetching the same page more than once.
you can find free spider implementation here http://www.acme.com.
it works very good (like most Jeff's things ;-)
 
G

Guest

How about URL.equal()? The java doc days Two URL objects are equal if they
have the same protocol, reference equivalent hosts, have the same port
number on the host, and the same file and fragment of the file.

not sure if this will take care of the trailing spaces.....also hosts are
compared by IP address, so that(IP address lookup) may turn out to be
expensive.....

Not to mention if there are multiple virtual hosts with the same IP
address.

La'ie Techie
 
C

Chris Smith

=?UTF-8?b?TMSByrtpZSBUZWNoaWU=?=
Not to mention if there are multiple virtual hosts with the same IP
address.

Exactly. URL.equals is broken, and shouldn't be used. Use URI.equals
instead.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top