How to parse a unicode url?

D

Dan The man

I would really like to be able to do the following. Is this even
possible?

Thanks,
nerdytenor

uri = URI.parse('http://www.hören.com') # not a real url (that I know
of)
URI::InvalidURIError: bad URI(is not URI?): http://www.hören.com
from /usr/lib/ruby/1.8/uri/common.rb:432:in `split'
from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'
from (irb):26
 
7

7stud --

Dan said:
I would really like to be able to do the following. Is this even
possible?

Thanks,
nerdytenor

uri = URI.parse('http://www.hören.com') # not a real url (that I know
of)
URI::InvalidURIError: bad URI(is not URI?): http://www.hören.com
from /usr/lib/ruby/1.8/uri/common.rb:432:in `split'
from /usr/lib/ruby/1.8/uri/common.rb:481:in `parse'
from (irb):26


You can do this:

require "uri"

url = "http://www.hören.co"ö
enc_url = URI.encode(url)
puts enc_url


to get this:

http://www.hören.co

which according to wikipedia here:

http://en.wikipedia.org/wiki/Percent-encoding

is a legal uri. But when I do this:


require "uri"

url = "http://www.hören.co"
enc_url = URI.encode(url)
puts enc_url

uri = URI.parse(enc_url)

I get this:

http://www.hören.co
/usr/lib/ruby/1.8/uri/generic.rb:194:in `initialize': the scheme http
does not accept registry part: www.h%C3%B6ren.co (or bad hostname?)
(URI::InvalidURIError)
from /usr/lib/ruby/1.8/uri/http.rb:46:in `initialize'
from /usr/lib/ruby/1.8/uri/common.rb:484:in `new'
from /usr/lib/ruby/1.8/uri/common.rb:484:in `parse'
from r3test.rb:7


which as far as I can tell means that URI.parse() is broken.
 
R

Robert Klemme

There is no such thing as a Unicode URL. The RFC for URI and URL
specify the charset as 7Bit ASCII AFAIK.

The legal form of that URL is this: http://www.xn--hren-5qa.com/

See IDNA for details, for example:
http://de.wikipedia.org/wiki/IDNA

Quick searching revealed this - maybe it can help:
http://rubyforge.org/pipermail/idn-discuss/2005-September/000000.html
You can do this:

require "uri"

url = "http://www.hören.co
enc_url = URI.encode(url)
puts enc_url


to get this:

http://www.hören.co

which according to wikipedia here:

http://en.wikipedia.org/wiki/Percent-encoding

is a legal uri. But when I do this:


require "uri"

url = "http://www.hören.co"
enc_url = URI.encode(url)
puts enc_url

uri = URI.parse(enc_url)

I get this:

http://www.hören.co
/usr/lib/ruby/1.8/uri/generic.rb:194:in `initialize': the scheme http
does not accept registry part: www.h%C3%B6ren.co (or bad hostname?)
(URI::InvalidURIError)
from /usr/lib/ruby/1.8/uri/http.rb:46:in `initialize'
from /usr/lib/ruby/1.8/uri/common.rb:484:in `new'
from /usr/lib/ruby/1.8/uri/common.rb:484:in `parse'
from r3test.rb:7


which as far as I can tell means that URI.parse() is broken.

I don't think so. There are invalid characters in the domain name (as
the exception indicates).

Kind regards

robert
 
O

Ollivier Robert

Can you identify which character is invalid in:

http://www.hören.co

According to wikipedia, all those characters are valid for a uri.

They are but host & domain names do not accept Unicode characters at all and are limited to 7 bits ASCII. Search for IDN for more information.
 
D

Dan The man

Ollivier said:
They are but host & domain names do not accept Unicode characters at all
and are limited to 7 bits ASCII. Search for IDN for more information.

I thought this might be the case. However, typing the following into
firefox gets me a real live page (after trying a few random domains)

http://www.hören.at/

Hmmm...
 
E

Eric Hodel

Can you identify which character is invalid in:

http://www.hören.co

According to wikipedia, all those characters are valid for a uri.

It says that what characters are valid for each piece of a URI is
dependent on the URI scheme. The characters valid for the hostname
part of the http URI scheme is goverened by the DNS system, so you
need to use an IDN.

I believe there is a ruby wrapper for libidn.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top