URL encoding api in Java 1.4.2

S

Saju Pillai

The java.net.URLEncoder actually performs the older style of
form-encoding (space -> +). Is there a library api that performs the
newer style percentage encoding (space -> %20) ?

srp
 
J

John B. Matthews

Saju Pillai said:
The java.net.URLEncoder actually performs the older style of
form-encoding (space -> +). Is there a library api that performs the
newer style percentage encoding (space -> %20) ?

The API for java.net.URLEncoder mentions "converting a String to the
application/x-www-form-urlencoded MIME format."

http://java.sun.com/javase/6/docs/api/java/net/URLEncoder.html

The cited specification includes the described encoding:

http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4

What standard do you propose? Would it be hard to implement?
 
S

Saju Pillai

John said:
The API for java.net.URLEncoder mentions "converting a String to the
application/x-www-form-urlencoded MIME format."

http://java.sun.com/javase/6/docs/api/java/net/URLEncoder.html

The cited specification includes the described encoding:

http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4

What standard do you propose? Would it be hard to implement?

I am looking for a 1.4.2 library api that can do percent-encoding as
specified in RFC 3986 (sec 2.1).

pct-encoded = "%" HEXDIG HEXDIG

URLEncoder does pct-encoded in most cases but replaces SPACE (ascii
0x20) with '+' instead of %20 which doesn't work for me. URLEncoder
works true per it's documentation - but the name FormEncoder would be a
better fit, a URI encoder should likely perform full percent encoding by
default

srp
 
M

Mark Space

Saju said:
The java.net.URLEncoder actually performs the older style of
form-encoding (space -> +). Is there a library api that performs the
newer style percentage encoding (space -> %20) ?


Just what is the whole history behind %20? I thought it was a hack some
vendor came up with (Microsoft? Netscape?) before there were standards.
That would make the + the newer style and %20 the older, deprecated
one, but I don't know for sure.
 
S

Saju Pillai

Mark said:
Just what is the whole history behind %20? I thought it was a hack some
vendor came up with (Microsoft? Netscape?) before there were standards.
That would make the + the newer style and %20 the older, deprecated
one, but I don't know for sure.

%20 comes from the percent-encoding scheme which is described in RFC
3986 (URI Generic Syntax). AFAIK, the '+' is the older encoding
technique which was obsoleted by predecessors of RFC 3986. I think '+'
is the vendor hack. I would be happy to be corrected.

At least in the python lib, there are 2 variants of 'uri encoding'
urllib.quote() and urllib.quote_plus() the first one does space->%20 and
2nd differs only in space->+.

I am looking for a java equivalent that does the space->%20 (and I think
the correct) form.

srp
 
M

Mark Space

Saju said:
I am looking for a 1.4.2 library api that can do percent-encoding as
specified in RFC 3986 (sec 2.1).

pct-encoded = "%" HEXDIG HEXDIG

URLEncoder does pct-encoded in most cases but replaces SPACE (ascii
0x20) with '+' instead of %20 which doesn't work for me. URLEncoder
works true per it's documentation - but the name FormEncoder would be a
better fit, a URI encoder should likely perform full percent encoding by
default

Section 3.3 of that RFC says that paths can consist of, among other
things, pchars which are:

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

A '+' is a sub-delims. Ditto for section 3.4 and 3.5 which describe
queries and fragments. Both permit a "pchar" which includes the '+'.

I think RFC 3986 says how to interpret URIs and how they *may* be
encoded, not how they *must* be encoded. I.e., there's some wiggle room
as long as the encoding is valid. Thus as long as java.net.URLEncoder
didn't add a space somewhere, it's valid under this RFC.

What tool are you using that doesn't interpret the + correctly? That's
my question.
 
J

John B. Matthews

Mark Space said:
Just what is the whole history behind %20? I thought it was a hack
some vendor came up with (Microsoft? Netscape?) before there were
standards. That would make the + the newer style and %20 the older,
deprecated one, but I don't know for sure.

I am lost in a maze of twisty little standards, all different.
 
M

Mark Space

Saju said:
At least in the python lib, there are 2 variants of 'uri encoding'
urllib.quote() and urllib.quote_plus() the first one does space->%20 and
2nd differs only in space->+.

I'm looking at the spec which John linked to, which uses the term "must"
when talking about + encoding. Python libs don't define any standards.
I'm not sure quoting python as if it's authoritative is a great idea.



"This is the default content type. Forms submitted with this content
type must be encoded as follows:
^^^^

1. Control names and values are escaped. Space characters are
replaced by `+', and then reserved characters are escaped as described
in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by
`%HH', a percent sign and two hexadecimal digits representing the ASCII
code of the character. Line breaks are represented as "CR LF" pairs
(i.e., `%0D%0A').
2. The control names/values are listed in the order they appear in
the document. The name is separated from the value by `=' and name/value
pairs are separated from each other by `&'.

"

My emphasis.
 
M

Mark Space

John said:
I am lost in a maze of twisty little standards, all different.

Yes well, the internet is a series of tubes.

Twisty little tubes, all different.
 
M

Mark Space

John said:
RFC 3986 discusses URI syntax. You might look at URI, in which 'The
space character, for example, is quoted by replacing it with "%20"':

http://java.sun.com/javase/6/docs/api/java/net/URI.html

Oh that's interesting. So URIs are the things that go inside other
documents, and URLs are the things that get sent to HTTP servers.
(Based on that HTML 4.0 spec you linked to earlier.)

I hope the OP knows when to use each one. MS browsers will interpret
URIs as URLs but must other browsers (Firefox) won't. It'd be a good
way to be incompatible with half of the internet if you used the wrong one.
 
S

Stefan Ram

Saju Pillai said:
The java.net.URLEncoder actually performs the older style of
form-encoding (space -> +). Is there a library api that performs the
newer style percentage encoding (space -> %20) ?

public class Main
{ public static void main( final java.lang.String[] args )
throws java.net.URISyntaxException
{ final java.lang.String none =( java.lang.String )null;
java.lang.System.out.println
( new java.net.URI( none, none, none, -1, none, "a b+c", none ).
toString().substring( 1 )); }}

a%20b+c

(I would have expected something like »a%20b%2Bc«. But maybe
the »%2B« is being replaced by »+« when »toString()« is active.)
 
A

Arne Vajhøj

Saju said:
The java.net.URLEncoder actually performs the older style of
form-encoding (space -> +). Is there a library api that performs the
newer style percentage encoding (space -> %20) ?

I am not aware of any standard API that does that.

The following has been seen:
URLEncoder.encode(s, "UTF-8").replaceAll("\\+", "%20")

Which I guess in newer Java versions would be:
URLEncoder.encode(s, "UTF-8").replace("+", "%20")

Arne
 
A

Arne Vajhøj

Mark said:
I'm looking at the spec which John linked to, which uses the term "must"
when talking about + encoding. Python libs don't define any standards.
I'm not sure quoting python as if it's authoritative is a great idea.

Obviously not.

But it illustrates that the + versus %20 is a well known
problem.

Arne
 
M

Mark Space

Arne said:
I am not aware of any standard API that does that.

The following has been seen:
URLEncoder.encode(s, "UTF-8").replaceAll("\\+", "%20")

Which I guess in newer Java versions would be:
URLEncoder.encode(s, "UTF-8").replace("+", "%20")


As pointed out up thread (twice, now), the %20 thing is for URIs and the
+ is mandated by HTML 4.0 spec and that's what URL does. Use the URI
class to get the former.
 
J

John B. Matthews

Mark Space said:
Oh that's interesting. So URIs are the things that go inside other
documents, and URLs are the things that get sent to HTTP servers.
(Based on that HTML 4.0 spec you linked to earlier.)

I hope the OP knows when to use each one. MS browsers will interpret
URIs as URLs but must other browsers (Firefox) won't. It'd be a good
way to be incompatible with half of the internet if you used the
wrong one.

Good point. Firefox & Safari seem to recognize a few URI schemes like
file and ftp. Firefox can even do gopher.

The multi-argument URI constructors "quote illegal characters as
required by the components in which they appear. So, for example

URI uri = new URI("file", "/Temporary Items/index.html", null);
System.out.print(uri.toString());

produces 'file:/Temporary%20Items/index.html', as the OP wanted.
 
M

Mark Space

John said:
Good point. Firefox & Safari seem to recognize a few URI schemes like
file and ftp. Firefox can even do gopher.

The multi-argument URI constructors "quote illegal characters as
required by the components in which they appear. So, for example

URI uri = new URI("file", "/Temporary Items/index.html", null);
System.out.print(uri.toString());

produces 'file:/Temporary%20Items/index.html', as the OP wanted.

I've always been a little fuzzy on the difference between URLs and URIs,
but the document you quoted above seems to clear some things up. The
Java URL class follows the spec you need for returning FORM data to an
HTTP server. Most everything else uses URI, it seems.

I guess you might have to manually check the protocol. If it's HTTP,
use URL. Otherwise use URI.
 
A

Arne Vajhøj

Mark said:
As pointed out up thread (twice, now), the %20 thing is for URIs and the
+ is mandated by HTML 4.0 spec and that's what URL does. Use the URI
class to get the former.

URI has been mentioned, but I assumed it was meant as an
illustration of the + versus %20 problem not as a solution
to the question asked.

URI's behavior is significantly different from URLEncoder.

It may be possible that the original posters usage
og URLEncoder was broken in the first place and that
URI is the solution.

But assuming that URLEncoder is working except for the
+ versus %20 problem, then URI will break the solution.

Arne
 
T

Tom Anderson

Good point. Firefox & Safari seem to recognize a few URI schemes like
file and ftp. Firefox can even do gopher.

I think some people are getting pretty confused here. file and ftp are URL
schemes. With an 'L'. ftp://sun.com/pub is a URL. It's right there in RFC
1738. But URLs - all of them - are also URIs. URIs are a superset of URLs.
It's not about whether they go in documents or are sent over HTTP.

There's another kind of URI that is not a URL, and that's a URN - a
Uniform Resource Name. URNs are funny little things that don't get seen
very often; they look like urn:scheme:path, with a conspicuous absence of
slashes, at least at the start. The big idea is that a URN is a name
rather than a location - it uniquely identifies something, but it doesn't
tell you how to find it. URNs are split up into different namespaces,
which are different ways of identifying elements of different sets of
things. For example, one of the URN namespaces is ISBN, for books, so
urn:isbn:978-1594743344 identifies a book - but doesn't immediately help
you find it. You'd have to go to some kind of resolver service to map it
to a URL which you could actually use - and that's the idea, since it
provides a layer of indirection which decouples identity of an object,
which is eternal, from the means to access it, which is transient.

The raison d'etre of the concept of a URI is merely to unify URLs and
URNs.

At least, that's how it started out. See RFC 3305 for meditations on
meaning and taxonomy.

The rules for URIs, URLs and URNs are in agreement on escaping: it's doe
with percent signs. Encoding spaces as pluses is not part of those
specifications.

Rather, the plus for space thing is part of the specification of the
application/x-www-form-urlencoded content type. This is a content type
which encodes a list of key-value pairs (where keys and values, or at
least values, can be arbitrary byte strings) as text comprising characters
from a limited subset of ASCII. It's like a cross between java's
properties file format and base-64. Anyway, it says to encode spaces as
pluses. The purpose of x-www-form-urlencoded is to encode the values in an
HTML form in such a way that they can be transmitted as part of a URL (or
as an entity body in a POST, but that's less of a driver): an
x-www-form-urlencoded string is safe to use as the query part of a URL.

To clarify, when you see a URL like:

http://www.google.co.uk/search?hl=en&safe=off&q=my+query&btnG=Search

There are *two* *different* layers of syntax here. First is the URI/URL,
syntax, which breaks the string down to:

Scheme: http
Authority: www.google.co.uk
Path: search
Query: hl=en&safe=off&q=my+query&btnG=Search

Second is x-www-form-urlencoding of the query part, which breaks it down
to:

hl: en
safe: off
q: my query
btnG: Search

Note that it is permitted to have raw + signs in the query part: they're
reserved characters in URI syntax, but in the lesser 'subcomponent
delimiter' set, rather than the greater 'generic delimiter' set, and that
means that they can be used unescaped in a part, provided that the syntax
for that part permits it. I can't find anything in a specification of the
http URL scheme that forbids + from the query part, and thus, applying
ancient Anglo-Saxon legal principles, it's permitted. If you don't like
it, you can always escape them:

http://www.google.co.uk/search?hl=en&safe=off&q=my+query&btnG=Search

I sincerely believe that that URL is exactly equivalent to the one above.
Although i note that Google doesn't think so. Hmm.

Finally, note that x-www-form-urlencoded only applies to the query part,
and only to queries which specifically use it as an encoding (it's the
default in HTML, which is why you see it so often). That means that this:

http://example.org/plus+path

is *not* x-www-form-urlencoded, and is *not* equivalent to:

http://example.org/plus path

Both of those are legal (i think) and different to each other.

tom
 
M

Mark Space

Tom said:
I think some people are getting pretty confused here. file and ftp are
URL schemes. With an 'L'. ftp://sun.com/pub is a URL. It's right there
in RFC 1738. But URLs - all of them - are also URIs. URIs are a superset
of URLs. It's not about whether they go in documents or are sent over HTTP.

Good points. I think it's exactly what the HTML 4 spec says: in
response to a HTML form, the URL of the posted response must be
x-www-form-urlencoded.



The rules for URIs, URLs and URNs are in agreement on escaping: it's doe
with percent signs. Encoding spaces as pluses is not part of those
specifications.

Fair enough.

Rather, the plus for space thing is part of the specification of the
application/x-www-form-urlencoded content type. This is a content type

Yup. If you're a user agent, replying to an HTML form (or sending a
POST to a server that expects a reply from a user agent interpreting
HTML), use the x-www-form-urlencoded form.

Otherwise... consult the relevant spec, because URI encoding alone won't
tell you what to do.

Note that it is permitted to have raw + signs in the query part: they're

Actually, nearly all parts allow the + sign in a URI. It's the other
specs that say you can't.

Thus, while the rules of URIs say this is legal: http://www.fu+bar.com
the rules of hostnames (not going to look up the RFC right now) says
"not!" So it's forbidden.

This is what I understand right now. URIs are just a general sort of
spec, there always seems to be a more specific one that applies also.

http://www.google.co.uk/search?hl=en&safe=off&q=my+query&btnG=Search

I sincerely believe that that URL is exactly equivalent to the one
above. Although i note that Google doesn't think so. Hmm.

Yeah, no go, because Google is expecting an HTML reply from a browser.
Thus, they bounce your encoding because it's not the form dictated in
the HTML 4.0 spec. Thus: valid URI, invalid HTML.
Finally, note that x-www-form-urlencoded only applies to the query part,
and only to queries which specifically use it as an encoding (it's the
default in HTML, which is why you see it so often). That means that this:

Hmm, might be a consequence of HTML 4.0 spec, but not in general. Other
schemes could use the + sign differently and in different parts of the
URI too.
http://example.org/plus+path

is *not* x-www-form-urlencoded, and is *not* equivalent to:

http://example.org/plus path

Both of those are legal (i think) and different to each other.


Legal: yes. Different: yes. One's gots a space, the other a plus. If
example.org isn't running a web server but instead say something like
SOAP or some custom program, it's up to that program/spec to interpret
the string however it needs to. I'm kinda suspicious of the HTTP part
though, I think it might not be legal to use that scheme for your own
purposes, strictly speaking.

Still, if we're not talking FORM response here, then I think it might be
legal for a web server to apply various interpretations to the two above
URLs.
 
R

Roedy Green

The java.net.URLEncoder actually performs the older style of
form-encoding (space -> +). Is there a library api that performs the
newer style percentage encoding (space -> %20) ?

Sun has two url-encoding mechanisms using the URI and the URLEncoder
classes. see http://mindprod.com/jgloss/urlencoded.html for details.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Here is a point of no return after which warming becomes unstoppable
and we are probably going to sail right through it.
It is the point at which anthropogenic (human-caused) warming triggers
huge releases of carbon dioxide from warming oceans, or similar releases
of both carbon dioxide and methane from melting permafrost, or both.
Most climate scientists think that point lies not far beyond 2°C (4°F) C hotter."
~ Gwynne Dyer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top