(extra) parsing (and verifying of) URL's via java.net.URI ...

Discussion in 'Java' started by qwertmonkey@syberianoutpost.ru, Oct 21, 2012.

  1. Guest

    ~
    as stated in the API
    ~
    http://docs.oracle.com/javase/7/docs/api/java/net/URI.html
    ~
    this would be an extra feature, but it would be nice if as part of the API
    you could tell apart what is a TLD, as well as suffixes that some regional
    registrar authorities mark as not publicly available. For example, there
    are lots of extra suffixes the Japanese apparently guard (basically all cities
    and prefectures):
    ~
    ".ac.jp", ".abashiri.hokkaido.jp", ...
    ~
    Noregeans do the same thing: ".fylkesbibl.no", ".gs.[county].no", ...
    ~
    I think in the states they have similar restrictions regarding tlds and
    suffixes such as ".edu" and ".mil" and the French even guard their dentists to
    the point of pairing them with surgeons: ".chirurgiens-dentistes.fr" ;-)
    ~
    Those parsing issues are important in use cases in which you want to know
    what is in a host name:
    ~
    $ host www.dos.state.fl.us
    www.dos.state.fl.us has address 207.156.20.19

    $ host dos.state.fl.us
    dos.state.fl.us has address 207.156.20.19
    dos.state.fl.us mail is handled by 10 mail.dos.state.fl.us.
    dos.state.fl.us mail is handled by 20 xm.dos.state.fl.us.

    $ host state.fl.us
    state.fl.us mail is handled by 10 dohsmsi07.doh.state.fl.us.
    state.fl.us mail is handled by 10 dohsmsi06.doh.state.fl.us.

    $ host doh.state.fl.us
    doh.state.fl.us has address 199.250.17.86
    doh.state.fl.us mail is handled by 1 mx0003.doh.state.fl.us.
    doh.state.fl.us mail is handled by 1 mx5201.doh.state.fl.us.
    doh.state.fl.us mail is handled by 1 mx0001.doh.state.fl.us.
    doh.state.fl.us mail is handled by 1 mx0002.doh.state.fl.us.

    $ host fl.us
    $
    ~
    subdepartments inside of a University department may have their own
    website (or not)
    ~
    $ host cornell.edu
    cornell.edu has address 128.253.173.242
    cornell.edu has address 128.253.173.243
    cornell.edu has address 128.253.173.244
    cornell.edu has address 128.253.173.245
    cornell.edu has address 128.253.173.246
    cornell.edu has address 128.253.173.241
    cornell.edu mail is handled by 10 router2.mail.cornell.edu.
    cornell.edu mail is handled by 10 router3.mail.cornell.edu.
    cornell.edu mail is handled by 10 router4.mail.cornell.edu.
    cornell.edu mail is handled by 10 router9.mail.cornell.edu.
    cornell.edu mail is handled by 10 router10.mail.cornell.edu.
    cornell.edu mail is handled by 10 router1.mail.cornell.edu.

    knoppix@Microknoppix:~$ host www.cs.cornell.edu
    www.cs.cornell.edu is an alias for www1.cs.cornell.edu.
    www1.cs.cornell.edu has address 128.84.154.137

    knoppix@Microknoppix:~$ host www.cornell.edu
    www.cornell.edu is an alias for wwwcornelledu-ssl.cit.cornell.edu.
    wwwcornelledu-ssl.cit.cornell.edu has address 132.236.204.10
    ~
    There is lots of info scattered all over the Internet about web naming issues
    ~
    http://www.cs.cornell.edu/people/egs/papers/dnssurvey.pdf
    ~
    but AFAIK there is not central registry of such data. Do you know better?
    ~
    thanks
    lbrtchx
    comp.lang.java.programmer: (extra) parsing (and verifying of) URL's via java.net.URI ...
    , Oct 21, 2012
    #1
    1. Advertising

  2. markspace Guest

    On 10/21/2012 7:45 AM, wrote:
    > ~
    > as stated in the API
    > ~
    > http://docs.oracle.com/javase/7/docs/api/java/net/URI.html



    Relating this to your subject title, URIs are not URLs. You can't
    really "verify" a URI, they're just strings.


    > ~
    > this would be an extra feature, but it would be nice if as part of the API
    > you could tell apart what is a TLD, as well as suffixes that some regional
    > registrar authorities mark as not publicly available.



    Why? Just send info to the host name. If it gets there it's correct.
    Otherwise send an error message to the user.

    Plus URLs may be relative. There might be no TLD in the string at all.


    > ~
    > Those parsing issues are important in use cases in which you want to know
    > what is in a host name:



    What makes you think this information is important? Why would you care?
    Just use the hostname as given.
    markspace, Oct 21, 2012
    #2
    1. Advertising

  3. Arne Vajhoej Guest

    On 10/21/2012 10:45 AM, wrote:
    > ~
    > as stated in the API
    > ~
    > http://docs.oracle.com/javase/7/docs/api/java/net/URI.html
    > ~
    > this would be an extra feature, but it would be nice if as part of the API
    > you could tell apart what is a TLD,


    You can get the hostname out and then split on period and take the last.

    > as well as suffixes that some regional
    > registrar authorities mark as not publicly available. For example, there
    > are lots of extra suffixes the Japanese apparently guard (basically all cities
    > and prefectures):
    > ~
    > ".ac.jp", ".abashiri.hokkaido.jp", ...
    > ~
    > Noregeans do the same thing: ".fylkesbibl.no", ".gs.[county].no", ...
    > ~
    > I think in the states they have similar restrictions regarding tlds and
    > suffixes such as ".edu" and ".mil" and the French even guard their dentists to
    > the point of pairing them with surgeons: ".chirurgiens-dentistes.fr" ;-)


    I don't think the Java vendors would want to maintain logic that reflect
    different policies in 200 TLD's.

    Maintaining something as simple as timezones is hard enough.

    Arne
    Arne Vajhoej, Oct 21, 2012
    #3
  4. Roedy Green Guest

    See http://mindprod.com/jgloss/tld.html
    and follow links. That will at least give you all the country domains
    and the major global ones.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    There are four possible ways to poke a card into a slot.
    Nearly always, only one way works. To me that betrays a
    Fascist mentality, demanding customers conform to some
    arbitrary rule, and hassling them to discover the magic
    orientation. The polite way to do it is to design the reader
    slot so that all four ways work, or so that all the customer
    has to do is put the card in the vicinity of the reader.
    Roedy Green, Oct 23, 2012
    #4
  5. Guest

    On Sunday, October 21, 2012 3:45:43 PM UTC+1, wrote:
    > this would be an extra feature, but it would be nice if as part of the API
    > you could tell apart what is a TLD, as well as suffixes that some regional
    > registrar authorities mark as not publicly available.


    I think markspace's advice is best as there is no foolproof way to determine if a string is a host name or a domain name (even if you look at DNS "A" / "AAA" records).

    But maybe http://publicsuffix.org/list/ (was https://wiki.mozilla.org/TLD_List) will help you do whatever it is you want to do.
    , Oct 23, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Simon Harris
    Replies:
    0
    Views:
    6,368
    Simon Harris
    May 10, 2005
  2. Stanimir Stamenkov
    Replies:
    1
    Views:
    2,469
    Stanimir Stamenkov
    Aug 17, 2005
  3. Pavel
    Replies:
    2
    Views:
    1,641
    Peter Flynn
    Aug 4, 2004
  4. etheriau
    Replies:
    1
    Views:
    666
    Pavel
    Aug 23, 2004
  5. Turbo
    Replies:
    2
    Views:
    157
    Turbo
    Nov 1, 2006
Loading...

Share This Page