java.net.URI.equals()

P

Peter Davis

I'm trying to figure something out -- java.net.URI.equals() makes no
sense the way it is specified:

"When testing the user-information, path, query, fragment, authority,
or scheme-specific parts of two URIs for equality, the raw forms rather
than the encoded forms of these components are compared and the
hexadecimal digits of escaped octets are compared without regard to
case."

First of all, is that a typo? I think it's trying to say "the raw
forms rather than the /decoded/ forms...". Aside from that, the
problem is that it would make sense if two URIs with encoded or
unencoded versions of the same characters should be equal, but they're
not. I wrote a little test class:

import java.net.*;

class Test {
public static void main(String[] args) throws Throwable {
URI u1 = new URI("foo%7Ebar");
URI u2 = new URI("foo%7ebar");
URI u3 = new URI("foo~bar");

System.out.println(u1 + " == " + u2 + " => " + u1.equals(u2));
System.out.println(u1 + " == " + u3 + " => " + u1.equals(u3));
System.out.println(u2 + " == " + u3 + " => " + u2.equals(u3));
}
}

Outputs:

foo%7Ebar == foo%7ebar => true
foo%7Ebar == foo~bar => false
foo%7ebar ==foo~bar => false

Why in the world would it compare the raw rather than decoded forms of
the URI? Anybody have any clues?
 
D

Daniel Bonniot

Peter said:
I'm trying to figure something out -- java.net.URI.equals() makes no
sense the way it is specified:

"When testing the user-information, path, query, fragment,
authority, or scheme-specific parts of two URIs for equality, the raw
forms rather than the encoded forms of these components are compared and
the hexadecimal digits of escaped octets are compared without regard
to case."

First of all, is that a typo? I think it's trying to say "the raw forms
rather than the /decoded/ forms...".

Unless by raw they mean "decoded". But the implementation does not seem to
match that.
Aside from that, the problem is
that it would make sense if two URIs with encoded or unencoded versions
of the same characters should be equal, but they're not. I wrote a
little test class:

import java.net.*;

class Test {
public static void main(String[] args) throws Throwable {
URI u1 = new URI("foo%7Ebar");
URI u2 = new URI("foo%7ebar");
URI u3 = new URI("foo~bar");

System.out.println(u1 + " == " + u2 + " => " +
u1.equals(u2));
System.out.println(u1 + " == " + u3 + " => " +
u1.equals(u3));
System.out.println(u2 + " == " + u3 + " => " +
u2.equals(u3));
}
}

Outputs:

foo%7Ebar == foo%7ebar => true
foo%7Ebar == foo~bar => false
foo%7ebar ==foo~bar => false

Why in the world would it compare the raw rather than decoded forms of
the URI? Anybody have any clues?

Not a direct answer, but what should happen when non-ascii characters occur,
(directly or in encoded form)? For instance, is %E1 equal to á (like it does
in some latin-? encodings).

Daniel
 
P

Peter Davis

Unless by raw they mean "decoded". But the implementation does not seem
to match that.

The rest of the class is consistent with "raw" meaning the
originally-parsed URI string, which will likely contain %XX escapes.

Perhaps that's the whole problem. Some spec author at Sun probably
meant raw==decoded, and some other implementor probably interpreted it
as raw==encoded-but-not-normalized, like it is throughout the rest of
the class. Otherwise it doesn't make any sense that two URIs can be
semantically equal via character escapes but not equals().
Aside from that, the problem is that it would make sense if two URIs
with encoded or unencoded versions of the same characters should be
equal, but they're not. I wrote a little test class:

[...snip...]

Outputs:

foo%7Ebar == foo%7ebar => true
foo%7Ebar == foo~bar => false
foo%7ebar ==foo~bar => false

Why in the world would it compare the raw rather than decoded forms of
the URI? Anybody have any clues?

Not a direct answer, but what should happen when non-ascii characters
occur, (directly or in encoded form)? For instance, is %E1 equal to á
(like it does in some latin-? encodings).

It's specified that %XX escapes are decoded as if they are UTF-8 bytes,
so %E1 wouldn't be equal to á but, for example, %E2%82%AC is equal to
the Euro character.

So anyway, I understand that the spec is the way it is, and it's
perfectly unambiguous and reproducable in this manner, but it just
seems useless to me.

For example, if you have a URI like "foo/../bar", and you invoke
normalize() on it, then that URI will be equal() to "bar". So there is
a way to normalize paths, but there is no way to normalize escape
sequences in a way that equals() will behave properly. This is a
problem because %7E (~) is a notoriously confused character, with some
applications escaping it and some not.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top