Creating a canonicalized url

Dan Cuddeford · Jan 24, 2008

Hello there guys,

I'm trying to track down an easy way to canonicalize a URL from with
ruby. I've been looking around for this but all I can find are some
procedure hacks sure as # canonicalize the url
if ($url -notmatch "^[a-z]+://") { $url = "http://$url" }

which isn't going to take into account everything according to RFC 2396

* Remove all leading and trailing dots
* Replace consecutive dots with a single dot.
* If the hostname can be parsed as an IP address, it should be
normalized to 4 dot-separated decimal values. The client should handle
any legal IP address encoding, including octal, hex, and fewer than 4
components.
* Lowercase the whole string.

# The sequences "/../" and "/./" in the path should be resolved, by
replacing "/./" with "/", and removing "/../" along with the preceding
path component.
# Runs of consecutive slashes should be replaced with a single slash
character.

So is there a method out there for this?

Rob Biedenharn · Jan 24, 2008

Hello there guys,

I'm trying to track down an easy way to canonicalize a URL from with
ruby. I've been looking around for this but all I can find are some
procedure hacks sure as # canonicalize the url
if ($url -notmatch "^[a-z]+://") { $url = "http://$url" }

which isn't going to take into account everything according to RFC
2396

* Remove all leading and trailing dots
* Replace consecutive dots with a single dot.
* If the hostname can be parsed as an IP address, it should be
normalized to 4 dot-separated decimal values. The client should handle
any legal IP address encoding, including octal, hex, and fewer than 4
components.
* Lowercase the whole string.

# The sequences "/../" and "/./" in the path should be resolved, by
replacing "/./" with "/", and removing "/../" along with the preceding
path component.
# Runs of consecutive slashes should be replaced with a single slash
character.

So is there a method out there for this?

I'd start looking at URI, in particular, URI#parse.

$ fri URI#parse
------------------------------------------------------------- URI:

arse
URI:

arse(uri)
------------------------------------------------------------------------
Synopsis
URI:

arse(uri_str)

Args
+uri_str+: String with URI.

Description
Creates one of the URI's subclasses instance from the string.

Raises
URI::InvalidURIError

Raised if URI given is not a correct one.

Usage
require 'uri'

uri = URI.parse("http://www.ruby-lang.org/")
p uri
# => #<URI::HTTP:0x202281be URL:http://www.ruby-lang.org/>
p uri.scheme
# => "http"
p uri.host
# => "www.ruby-lang.org"

As for the "Lowercase the whole string" part, only the domain is
required to be case-insensitive. It is possible for the underlying
web server to ignore case when finding a path, but the URI is not
necessarily a reference to the same resource if the case is altered.

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)

Jean-François Trân · Jan 24, 2008

2008/1/24 said:
As for the "Lowercase the whole string" part, only the domain is
required to be case-insensitive. It is possible for the underlying
web server to ignore case when finding a path, but the URI is not
necessarily a reference to the same resource if the case is altered.

There's URI#normalize and URI#normalize! to downcase the host
part of the url.

-- Jean-Fran=E7ois.

Dan Cuddeford · Jan 24, 2008

Thanks for your help - I'll let you know how I get on

Dan Cuddeford · Jan 24, 2008

So it seems using the two together

require 'uri'

uri = URI.parse("http://www.ruBy-lang.org/ARSE")

can = uri.normalize
p can

p can.host

p can.path

means the path keeps it's case sensitivity but the host is normalized.

I think that's it - however,

try it with ruby-lang..org and

/usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http
does not accept registry part: www.ruBy-lang..org (or bad hostname?)
(URI::InvalidURIError)
from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize'
from /usr/lib/ruby/1.8/uri/common.rb:488:in `new'
from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse'
from canon.rb:3

So I guess it needs a bit or error checking before hand.

Rob Biedenharn · Jan 24, 2008

So it seems using the two together

require 'uri'

uri = URI.parse("http://www.ruBy-lang.org/ARSE")

can = uri.normalize
p can

p can.host

p can.path

means the path keeps it's case sensitivity but the host is normalized.

I think that's it - however,

try it with ruby-lang..org and

/usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http
does not accept registry part: www.ruBy-lang..org (or bad hostname?)
(URI::InvalidURIError)
from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize'
from /usr/lib/ruby/1.8/uri/common.rb:488:in `new'
from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse'
from canon.rb:3

So I guess it needs a bit or error checking before hand.

require 'uri'

def canonicalize(uri)
u = uri.kind_of?(URI) ? uri : URI.parse(uri.to_s)
u.normalize!
newpath = u.path
while newpath.gsub!(%r{([^/]+)/\.\./?}) { |match|
$1 == '..' ? match : ''
} do end
newpath = newpath.gsub(%r{/\./}, '/').sub(%r{/\.\z}, '/')
u.path = newpath
u.to_s
end

canonicalize('http://www.Ruby-Lang.ORG/ARSE/done/../../rear/./end/.')
=> "http://www.ruby-lang.org/rear/end/"

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)

Dan Cuddeford · Jan 24, 2008

Wow - thanks for the answer mate!

Jörg W Mittag · Jan 24, 2008

Dan said:
Wow - thanks for the answer mate!

There's also the Addressable Gem: <http://Addressable.RubyForge.Org/>.

It's intended as a standards compliant replacement for the stdlib's
URI library. Take a look into the test directory of that sucker: over
440 Unit Tests (actually, Object Examples) for a frickin' URI parser!
(See: <http://Addressable.RubyForge.Org/specdoc/>) That guy is nuts!
That code's gotta be as rock-solid as it gets.

Oh, and back to the topic at hand: it has a normalize method built in:

begin
require 'rubygems'
gem 'addressable'
rescue LoadError; end
require 'addressable/uri'
uri = Addressable::URI.heuristic_parse('www.Ruby-Lang..ORG/ARSE/done/../../r e a r/./end/.#exit')
uri.normalize!
puts uri.display_uri # => http://www.ruby-lang..org/r e a r/end/#exit

jwm

Dan Cuddeford · Jan 25, 2008

JÃ¶rg W Mittag said:
puts uri.display_uri # =>
http://www.ruby-lang..org/r e a r/end/#exit

jwm

Nice but shouldn't it go to ruby-lang.org?

Jörg W Mittag · Jan 26, 2008

Dan said:
Nice but shouldn't it go to ruby-lang.org?

I'm not sure. I just scanned RfC3986 and RfC1034 and I'm not even sure
that's a valid URI host part to begin with. *If* it's invalid, then
there's not much a URI normalizer can do, right?

However, I could be wrong. Reading RfCs is not exactly my specialty.

jwm

Pass ruby-generated URL into a browser for execution: How?	1	Aug 8, 2011
"The resource cannot be found" using domain url	0	Feb 5, 2009
Re-building the URL with a different Host	1	Feb 16, 2008
Creating a Zip file from HTTP data stream	1	Jul 8, 2009
Creating a form that reads the parameters from the URL and can by used by IE7	3	Mar 5, 2007
How to use access url using Ruby	1	Jun 2, 2005
Resource/File URL Problems, relative links, servlet context and	1	Jun 2, 2008
How to avoid display of filename in URL	13	Feb 22, 2008

Creating a canonicalized url

Dan Cuddeford

Rob Biedenharn

Jean-François Trân

Dan Cuddeford

Dan Cuddeford

Rob Biedenharn

Dan Cuddeford

Jörg W Mittag

Dan Cuddeford

Jörg W Mittag

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads