Creating a canonicalized url

Discussion in 'Ruby' started by Dan Cuddeford, Jan 24, 2008.

  1. Hello there guys,

    I'm trying to track down an easy way to canonicalize a URL from with
    ruby. I've been looking around for this but all I can find are some
    procedure hacks sure as # canonicalize the url
    if ($url -notmatch "^[a-z]+://") { $url = "http://$url" }

    which isn't going to take into account everything according to RFC 2396

    * Remove all leading and trailing dots
    * Replace consecutive dots with a single dot.
    * If the hostname can be parsed as an IP address, it should be
    normalized to 4 dot-separated decimal values. The client should handle
    any legal IP address encoding, including octal, hex, and fewer than 4
    components.
    * Lowercase the whole string.


    # The sequences "/../" and "/./" in the path should be resolved, by
    replacing "/./" with "/", and removing "/../" along with the preceding
    path component.
    # Runs of consecutive slashes should be replaced with a single slash
    character.

    So is there a method out there for this?
    --
    Posted via http://www.ruby-forum.com/.
    Dan Cuddeford, Jan 24, 2008
    #1
    1. Advertising

  2. On Jan 24, 2008, at 7:14 AM, Dan Cuddeford wrote:

    > Hello there guys,
    >
    > I'm trying to track down an easy way to canonicalize a URL from with
    > ruby. I've been looking around for this but all I can find are some
    > procedure hacks sure as # canonicalize the url
    > if ($url -notmatch "^[a-z]+://") { $url = "http://$url" }
    >
    > which isn't going to take into account everything according to RFC
    > 2396
    >
    > * Remove all leading and trailing dots
    > * Replace consecutive dots with a single dot.
    > * If the hostname can be parsed as an IP address, it should be
    > normalized to 4 dot-separated decimal values. The client should handle
    > any legal IP address encoding, including octal, hex, and fewer than 4
    > components.
    > * Lowercase the whole string.
    >
    >
    > # The sequences "/../" and "/./" in the path should be resolved, by
    > replacing "/./" with "/", and removing "/../" along with the preceding
    > path component.
    > # Runs of consecutive slashes should be replaced with a single slash
    > character.
    >
    > So is there a method out there for this?


    I'd start looking at URI, in particular, URI#parse.

    $ fri URI#parse
    ------------------------------------------------------------- URI::parse
    URI::parse(uri)
    ------------------------------------------------------------------------
    Synopsis
    URI::parse(uri_str)

    Args
    +uri_str+: String with URI.

    Description
    Creates one of the URI's subclasses instance from the string.

    Raises
    URI::InvalidURIError

    Raised if URI given is not a correct one.

    Usage
    require 'uri'

    uri = URI.parse("http://www.ruby-lang.org/")
    p uri
    # => #<URI::HTTP:0x202281be URL:http://www.ruby-lang.org/>
    p uri.scheme
    # => "http"
    p uri.host
    # => "www.ruby-lang.org"

    As for the "Lowercase the whole string" part, only the domain is
    required to be case-insensitive. It is possible for the underlying
    web server to ignore case when finding a path, but the URI is not
    necessarily a reference to the same resource if the case is altered.

    -Rob

    Rob Biedenharn http://agileconsultingllc.com
    Rob Biedenharn, Jan 24, 2008
    #2
    1. Advertising

  3. 2008/1/24, Rob Biedenharn <>:

    > As for the "Lowercase the whole string" part, only the domain is
    > required to be case-insensitive. It is possible for the underlying
    > web server to ignore case when finding a path, but the URI is not
    > necessarily a reference to the same resource if the case is altered.


    There's URI#normalize and URI#normalize! to downcase the host
    part of the url.

    -- Jean-Fran=E7ois.
    Jean-François Trân, Jan 24, 2008
    #3
  4. Dan Cuddeford, Jan 24, 2008
    #4
  5. So it seems using the two together


    require 'uri'

    uri = URI.parse("http://www.ruBy-lang.org/ARSE")

    can = uri.normalize
    p can

    p can.host

    p can.path


    means the path keeps it's case sensitivity but the host is normalized.

    I think that's it - however,

    try it with ruby-lang..org and

    /usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http
    does not accept registry part: www.ruBy-lang..org (or bad hostname?)
    (URI::InvalidURIError)
    from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize'
    from /usr/lib/ruby/1.8/uri/common.rb:488:in `new'
    from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse'
    from canon.rb:3

    So I guess it needs a bit or error checking before hand.
    --
    Posted via http://www.ruby-forum.com/.
    Dan Cuddeford, Jan 24, 2008
    #5
  6. On Jan 24, 2008, at 9:23 AM, Dan Cuddeford wrote:

    > So it seems using the two together
    >
    >
    > require 'uri'
    >
    > uri = URI.parse("http://www.ruBy-lang.org/ARSE")
    >
    > can = uri.normalize
    > p can
    >
    > p can.host
    >
    > p can.path
    >
    >
    > means the path keeps it's case sensitivity but the host is normalized.
    >
    > I think that's it - however,
    >
    > try it with ruby-lang..org and
    >
    > /usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http
    > does not accept registry part: www.ruBy-lang..org (or bad hostname?)
    > (URI::InvalidURIError)
    > from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize'
    > from /usr/lib/ruby/1.8/uri/common.rb:488:in `new'
    > from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse'
    > from canon.rb:3
    >
    > So I guess it needs a bit or error checking before hand.


    require 'uri'

    def canonicalize(uri)
    u = uri.kind_of?(URI) ? uri : URI.parse(uri.to_s)
    u.normalize!
    newpath = u.path
    while newpath.gsub!(%r{([^/]+)/\.\./?}) { |match|
    $1 == '..' ? match : ''
    } do end
    newpath = newpath.gsub(%r{/\./}, '/').sub(%r{/\.\z}, '/')
    u.path = newpath
    u.to_s
    end

    canonicalize('http://www.Ruby-Lang.ORG/ARSE/done/../../rear/./end/.')
    => "http://www.ruby-lang.org/rear/end/"

    -Rob

    Rob Biedenharn http://agileconsultingllc.com
    Rob Biedenharn, Jan 24, 2008
    #6
  7. Dan Cuddeford, Jan 24, 2008
    #7
  8. Dan Cuddeford wrote:
    > Wow - thanks for the answer mate!


    There's also the Addressable Gem: <http://Addressable.RubyForge.Org/>.

    It's intended as a standards compliant replacement for the stdlib's
    URI library. Take a look into the test directory of that sucker: over
    440 Unit Tests (actually, Object Examples) for a frickin' URI parser!
    (See: <http://Addressable.RubyForge.Org/specdoc/>) That guy is nuts!
    That code's gotta be as rock-solid as it gets.

    Oh, and back to the topic at hand: it has a normalize method built in:

    begin
    require 'rubygems'
    gem 'addressable'
    rescue LoadError; end
    require 'addressable/uri'
    uri = Addressable::URI.heuristic_parse('www.Ruby-Lang..ORG/ARSE/done/../../r e a r/./end/.#exit')
    uri.normalize!
    puts uri.display_uri # => http://www.ruby-lang..org/r e a r/end/#exit

    jwm
    Jörg W Mittag, Jan 24, 2008
    #8
  9. Dan Cuddeford, Jan 25, 2008
    #9
  10. Dan Cuddeford wrote:
    > Jörg W Mittag wrote:
    >> puts uri.display_uri # =>
    >> http://www.ruby-lang..org/r e a r/end/#exit

    > Nice but shouldn't it go to ruby-lang.org?


    I'm not sure. I just scanned RfC3986 and RfC1034 and I'm not even sure
    that's a valid URI host part to begin with. *If* it's invalid, then
    there's not much a URI normalizer can do, right?

    However, I could be wrong. Reading RfCs is not exactly my specialty.

    jwm
    Jörg W Mittag, Jan 26, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Doug
    Replies:
    0
    Views:
    459
  2. Jon paugh
    Replies:
    1
    Views:
    683
  3. wl
    Replies:
    1
    Views:
    4,399
  4. Celedor
    Replies:
    3
    Views:
    380
    Peter Flynn
    Jan 24, 2004
  5. Just D.
    Replies:
    0
    Views:
    400
    Just D.
    Aug 11, 2004
Loading...

Share This Page