Ahhh.. URL wants to get encoded. Does Java wanna?

  • Thread starter =?iso-8859-1?B?RnJhbudvaXM=?=
  • Start date
?

=?iso-8859-1?B?RnJhbudvaXM=?=

Now:
URL.html">URL Class</a>:

"Note, the URI class does perform escaping of its component fields in
certain circumstances. The recommended way to manage the encoding and
decoding of URLs is to use URI, and to convert between these two
classes using toURI() and URI.toURL().

The URLEncoder and URLDecoder classes can also be used, but only for
HTML form encoding, which is not the same as the encoding scheme
defined in RFC2396."


I don't care about form encoding. Just want to encode a string into a
readable URL (RFC2396: thanks Berners-Lee, you described the web. The
implementation is rough around the edges, sometimes).

So, from what I understand of the mentionned Java 5.0 docs, I wrote
this Java line, non-sensical, but doc-abiding:

encodedURL = (new URL(urlString)).toURI().toURL();

If I want to get the encoded URL, URL-ready:

encodedURL.toString();

Ahhh... the original string, not encoded at all.

Please help, before, I go back to PHP and cry sour tears:

echo (urlEncode(http://www.javahhh.com/somephpfile.php?
myunrequisitedlove=Java&myworkhorse=PHP));

Thanks, fellow coders
 
?

=?iso-8859-1?B?RnJhbudvaXM=?=

Now:


URL.html">URL Class</a>:

"Note, the URI class does perform escaping of its component fields in
certain circumstances. The recommended way to manage the encoding and
decoding of URLs is to use URI, and to convert between these two
classes using toURI() and URI.toURL().

The URLEncoder and URLDecoder classes can also be used, but only for
HTML form encoding, which is not the same as the encoding scheme
defined in RFC2396."

I don't care about form encoding. Just want to encode a string into a
readable URL (RFC2396: thanks Berners-Lee, you described the web. The
implementation is rough around the edges, sometimes).

So, from what I understand of the mentionned Java 5.0 docs, I wrote
this Java line, non-sensical, but doc-abiding:

encodedURL = (new URL(urlString)).toURI().toURL();

If I want to get the encoded URL, URL-ready:

encodedURL.toString();

Ahhh... the original string, not encoded at all.

Please help, before, I go back to PHP and cry sour tears:

echo (urlEncode(http://www.javahhh.com/somephpfile.php?
myunrequisitedlove=Java&myworkhorse=PHP));


Thanks, fellow coders

Ohhh: even Google newsgroup parser understands it:

& = &amp;

I did it! I encoded the ampersand (by hand mind you).(Sheer
brainpower)
 
W

Wayne

Roedy said:

Roedy,

I just tried using URI, it doesn't seem to escape/encode
an ampersand in any part of the URI. Also, what about the
new IRIs? A Java program should be robust enough to
handle legal URLs/URIs/IRIs, converting the the (upto)
nine parts of an IRI correctly. My understanding of
your (excellent) urlencoded page and the API docs means this:

URI uri = new URI("http", "//www.example.com/you & I 10%? wierd & wierder", null);
System.out.println( uri.toURL() );

should produce:
http://www.example.com/you & I 10%? wierd & wierder
But it produces:
http://www.example.com/you & I 10%? wierd & wierder

(The ampersand is not encoded.) What did I do wrong?

-Wayne
 
R

Roedy Green

URI uri = new URI("http", "//www.example.com/you & I 10%? wierd & wierder", null);
System.out.println( uri.toURL() );

the way I read RFC 2396 is that reserved chars:
; / ? : @ & = + $ ,
are not supposed to be escaped. Perhaps Patricia could read the RFC
and tell us what it really means.

I wish the people who write RFCs would provide examples to illustrate
the true meaning of the lawyerese.
 
W

Wayne

Wayne said:
Roedy,

I just tried using URI, it doesn't seem to escape/encode
an ampersand in any part of the URI. Also, what about the
new IRIs? A Java program should be robust enough to
handle legal URLs/URIs/IRIs, converting the the (upto)
nine parts of an IRI correctly. My understanding of
your (excellent) urlencoded page and the API docs means this:

URI uri = new URI("http", "//www.example.com/you & I 10%? wierd & wierder", null);
System.out.println( uri.toURL() );

should produce:
http://www.example.com/you & I 10%? wierd & wierder
But it produces:
http://www.example.com/you & I 10%? wierd & wierder

(The ampersand is not encoded.) What did I do wrong?

-Wayne

I guess the answer is to encode the query part separately, if needed.
The following code seems to work:

public String encodeURL ( String initialURL, boolean parseQuery )
{
// Parse the URL (without encoding):
URL url = new URL( initialURL );
String scheme = url.getProtocol(); // E.g., "http"
String authority = url.getAuthority(); // E.g., "//user@host:port"
String path = url.getPath(); // E.g., "/foo/bar.htm"
String query = url.getQuery(); // E.g., "foo=bar" (starts with '?")
if ( parseQuery )
query = URLEncoder.encode( query, "UTF-8" );
String fragment = url.getRef(); // I.e., the "anchor"

// Assemble the encoded URL, using URI class to properly
// encode each part:
URI uri = new URI( scheme, authority, path, query, fragment );
return uri.toString();
}

-Wayne
 
S

Steven Simpson

François said:
encodedURL = (new URL(urlString)).toURI().toURL();

If I want to get the encoded URL, URL-ready:

encodedURL.toString();

Ahhh... the original string, not encoded at all.

Please help, before, I go back to PHP and cry sour tears:

echo (urlEncode(http://www.javahhh.com/somephpfile.php?
myunrequisitedlove=Java&myworkhorse=PHP))
http://www.javahhh.com/somephpfile.php?myunrequisitedlove=Java&amp;mywife=PHP

The conversion from "&" to "&amp;" is not relevant to URI encoding - it
is HTML-encoding (and XML, etc). java.net.URI has no knowledge of this,
and should not have. It does not know whether you're going to put the
result into an HTML file or something else.

If you're writing out any literal text as part of HTML, including a URI
with "&" in it, you independently need an encoding to map "&" to
"&amp;", "<" to "&lt;", etc.
 
S

Sabine Dinis Blochberger

What I understood the correct way to be is, encode your URL parameter
part (after the '?') in UTF-8 then use java.net.UrlEncoder.encode().
 
?

=?iso-8859-1?B?RnJhbudvaXM=?=

Fellow Coders:

Thanks very much. Read all the fine answers, which put me on the
right track.

The simple goal is to produce a sitemap.xml, with the url properly
encoded (as specified at http://sitemaps.org/protocol.php).

An exemple, taken from this sitemaps.org:

http://www.example.com/ümlat.php&q=name should become
http://www.example.com/ümlat.php&amp;q=name

(If the above is encoded by Google Group parser: ümlaut => %C3%BCmlat
and & => &amp;)

The code snippet posted by Wayne works ok, but the ümlaut stays an
ümlaut, since it is not part of the query.

And we can't URLEncode the whole string, since forward slashes and
other valid characters will be transformed in UTF8 char codes.

The brute force way to do it (and the only way I found could work with
the sitemap.org example) is to take the initial string, parse every
single char, replace <, >, & and " with their escaped version (&amp;,
&lt;, etc.) as Steven indicated, and finally test any remaining char
to be within the range \u0000 to \u007F (the Basic Latin block) and
encode any char outside that range with this class, taken straight out
of the W3C website (http://www.w3.org/International/O-URL-code.html):


/**********************************************************/

public class URLUTF8Encoder
{

final static String[] hex = {
"%00", "%01", "%02", "%03", "%04", "%05", "%06", "%07",
"%08", "%09", "%0a", "%0b", "%0c", "%0d", "%0e", "%0f",
"%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17",
"%18", "%19", "%1a", "%1b", "%1c", "%1d", "%1e", "%1f",
"%20", "%21", "%22", "%23", "%24", "%25", "%26", "%27",
"%28", "%29", "%2a", "%2b", "%2c", "%2d", "%2e", "%2f",
"%30", "%31", "%32", "%33", "%34", "%35", "%36", "%37",
"%38", "%39", "%3a", "%3b", "%3c", "%3d", "%3e", "%3f",
"%40", "%41", "%42", "%43", "%44", "%45", "%46", "%47",
"%48", "%49", "%4a", "%4b", "%4c", "%4d", "%4e", "%4f",
"%50", "%51", "%52", "%53", "%54", "%55", "%56", "%57",
"%58", "%59", "%5a", "%5b", "%5c", "%5d", "%5e", "%5f",
"%60", "%61", "%62", "%63", "%64", "%65", "%66", "%67",
"%68", "%69", "%6a", "%6b", "%6c", "%6d", "%6e", "%6f",
"%70", "%71", "%72", "%73", "%74", "%75", "%76", "%77",
"%78", "%79", "%7a", "%7b", "%7c", "%7d", "%7e", "%7f",
"%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87",
"%88", "%89", "%8a", "%8b", "%8c", "%8d", "%8e", "%8f",
"%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97",
"%98", "%99", "%9a", "%9b", "%9c", "%9d", "%9e", "%9f",
"%a0", "%a1", "%a2", "%a3", "%a4", "%a5", "%a6", "%a7",
"%a8", "%a9", "%aa", "%ab", "%ac", "%ad", "%ae", "%af",
"%b0", "%b1", "%b2", "%b3", "%b4", "%b5", "%b6", "%b7",
"%b8", "%b9", "%ba", "%bb", "%bc", "%bd", "%be", "%bf",
"%c0", "%c1", "%c2", "%c3", "%c4", "%c5", "%c6", "%c7",
"%c8", "%c9", "%ca", "%cb", "%cc", "%cd", "%ce", "%cf",
"%d0", "%d1", "%d2", "%d3", "%d4", "%d5", "%d6", "%d7",
"%d8", "%d9", "%da", "%db", "%dc", "%dd", "%de", "%df",
"%e0", "%e1", "%e2", "%e3", "%e4", "%e5", "%e6", "%e7",
"%e8", "%e9", "%ea", "%eb", "%ec", "%ed", "%ee", "%ef",
"%f0", "%f1", "%f2", "%f3", "%f4", "%f5", "%f6", "%f7",
"%f8", "%f9", "%fa", "%fb", "%fc", "%fd", "%fe", "%ff"
};

/**
* Encode a string to the "x-www-form-urlencoded" form, enhanced
* with the UTF-8-in-URL proposal. This is what happens:
*
* <ul>
* <li><p>The ASCII characters 'a' through 'z', 'A' through 'Z',
* and '0' through '9' remain the same.
*
* <li><p>The unreserved characters - _ . ! ~ * ' ( ) remain the
same.
*
* <li><p>The space character ' ' is converted into a plus sign '+'.
*
* <li><p>All other ASCII characters are converted into the
* 3-character string "%xy", where xy is
* the two-digit hexadecimal representation of the character
* code
*
* <li><p>All non-ASCII characters are encoded in two steps: first
* to a sequence of 2 or 3 bytes, using the UTF-8 algorithm;
* secondly each of these bytes is encoded as "%xx".
* </ul>
*
* @param s The string to be encoded
* @return The encoded string
*/
public static String encode(String s)
{
StringBuffer sbuf = new StringBuffer();
int len = s.length();
for (int i = 0; i < len; i++) {
int ch = s.charAt(i);
if ('A' <= ch && ch <= 'Z') { // 'A'..'Z'
sbuf.append((char)ch);
} else if ('a' <= ch && ch <= 'z') { // 'a'..'z'
sbuf.append((char)ch);
} else if ('0' <= ch && ch <= '9') { // '0'..'9'
sbuf.append((char)ch);
} else if (ch == ' ') { // space
sbuf.append('+');
} else if (ch == '-' || ch == '_' // unreserved
|| ch == '.' || ch == '!'
|| ch == '~' || ch == '*'
|| ch == '\'' || ch == '('
|| ch == ')') {
sbuf.append((char)ch);
} else if (ch <= 0x007f) { // other ASCII
sbuf.append(hex[ch]);
} else if (ch <= 0x07FF) { // non-ASCII <= 0x7FF
sbuf.append(hex[0xc0 | (ch >> 6)]);
sbuf.append(hex[0x80 | (ch & 0x3F)]);
} else { // 0x7FF < ch <= 0xFFFF
sbuf.append(hex[0xe0 | (ch >> 12)]);
sbuf.append(hex[0x80 | ((ch >> 6) & 0x3F)]);
sbuf.append(hex[0x80 | (ch & 0x3F)]);
}
}
return sbuf.toString();
}

}


/**********************************************************/


Are we having fun yet? In this particular case (a very common case),
1 line of PHP equals over 100 lines of Java. My KLOC just went
through the roof and my employer suggested I take a very long
vacation, in some remote location.

Thanks again all.
 
D

Daniel Pitts

François said:
Fellow Coders:

Thanks very much. Read all the fine answers, which put me on the
right track.

The simple goal is to produce a sitemap.xml, with the url properly
encoded (as specified at http://sitemaps.org/protocol.php).

An exemple, taken from this sitemaps.org:

http://www.example.com/ümlat.php&q=name should become
http://www.example.com/ümlat.php&amp;q=name

(If the above is encoded by Google Group parser: ümlaut => %C3%BCmlat
and & => &amp;)

The code snippet posted by Wayne works ok, but the ümlaut stays an
ümlaut, since it is not part of the query.

And we can't URLEncode the whole string, since forward slashes and
other valid characters will be transformed in UTF8 char codes.

The brute force way to do it (and the only way I found could work with
the sitemap.org example) is to take the initial string, parse every
single char, replace <, >, & and " with their escaped version (&amp;,
&lt;, etc.) as Steven indicated, and finally test any remaining char
to be within the range \u0000 to \u007F (the Basic Latin block) and
encode any char outside that range with this class, taken straight out
of the W3C website (http://www.w3.org/International/O-URL-code.html):


/**********************************************************/

public class URLUTF8Encoder
{

final static String[] hex = {
"%00", "%01", "%02", "%03", "%04", "%05", "%06", "%07",
"%08", "%09", "%0a", "%0b", "%0c", "%0d", "%0e", "%0f",
"%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17",
"%18", "%19", "%1a", "%1b", "%1c", "%1d", "%1e", "%1f",
"%20", "%21", "%22", "%23", "%24", "%25", "%26", "%27",
"%28", "%29", "%2a", "%2b", "%2c", "%2d", "%2e", "%2f",
"%30", "%31", "%32", "%33", "%34", "%35", "%36", "%37",
"%38", "%39", "%3a", "%3b", "%3c", "%3d", "%3e", "%3f",
"%40", "%41", "%42", "%43", "%44", "%45", "%46", "%47",
"%48", "%49", "%4a", "%4b", "%4c", "%4d", "%4e", "%4f",
"%50", "%51", "%52", "%53", "%54", "%55", "%56", "%57",
"%58", "%59", "%5a", "%5b", "%5c", "%5d", "%5e", "%5f",
"%60", "%61", "%62", "%63", "%64", "%65", "%66", "%67",
"%68", "%69", "%6a", "%6b", "%6c", "%6d", "%6e", "%6f",
"%70", "%71", "%72", "%73", "%74", "%75", "%76", "%77",
"%78", "%79", "%7a", "%7b", "%7c", "%7d", "%7e", "%7f",
"%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87",
"%88", "%89", "%8a", "%8b", "%8c", "%8d", "%8e", "%8f",
"%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97",
"%98", "%99", "%9a", "%9b", "%9c", "%9d", "%9e", "%9f",
"%a0", "%a1", "%a2", "%a3", "%a4", "%a5", "%a6", "%a7",
"%a8", "%a9", "%aa", "%ab", "%ac", "%ad", "%ae", "%af",
"%b0", "%b1", "%b2", "%b3", "%b4", "%b5", "%b6", "%b7",
"%b8", "%b9", "%ba", "%bb", "%bc", "%bd", "%be", "%bf",
"%c0", "%c1", "%c2", "%c3", "%c4", "%c5", "%c6", "%c7",
"%c8", "%c9", "%ca", "%cb", "%cc", "%cd", "%ce", "%cf",
"%d0", "%d1", "%d2", "%d3", "%d4", "%d5", "%d6", "%d7",
"%d8", "%d9", "%da", "%db", "%dc", "%dd", "%de", "%df",
"%e0", "%e1", "%e2", "%e3", "%e4", "%e5", "%e6", "%e7",
"%e8", "%e9", "%ea", "%eb", "%ec", "%ed", "%ee", "%ef",
"%f0", "%f1", "%f2", "%f3", "%f4", "%f5", "%f6", "%f7",
"%f8", "%f9", "%fa", "%fb", "%fc", "%fd", "%fe", "%ff"
};

/**
* Encode a string to the "x-www-form-urlencoded" form, enhanced
* with the UTF-8-in-URL proposal. This is what happens:
*
* <ul>
* <li><p>The ASCII characters 'a' through 'z', 'A' through 'Z',
* and '0' through '9' remain the same.
*
* <li><p>The unreserved characters - _ . ! ~ * ' ( ) remain the
same.
*
* <li><p>The space character ' ' is converted into a plus sign '+'.
*
* <li><p>All other ASCII characters are converted into the
* 3-character string "%xy", where xy is
* the two-digit hexadecimal representation of the character
* code
*
* <li><p>All non-ASCII characters are encoded in two steps: first
* to a sequence of 2 or 3 bytes, using the UTF-8 algorithm;
* secondly each of these bytes is encoded as "%xx".
* </ul>
*
* @param s The string to be encoded
* @return The encoded string
*/
public static String encode(String s)
{
StringBuffer sbuf = new StringBuffer();
int len = s.length();
for (int i = 0; i < len; i++) {
int ch = s.charAt(i);
if ('A' <= ch && ch <= 'Z') { // 'A'..'Z'
sbuf.append((char)ch);
} else if ('a' <= ch && ch <= 'z') { // 'a'..'z'
sbuf.append((char)ch);
} else if ('0' <= ch && ch <= '9') { // '0'..'9'
sbuf.append((char)ch);
} else if (ch == ' ') { // space
sbuf.append('+');
} else if (ch == '-' || ch == '_' // unreserved
|| ch == '.' || ch == '!'
|| ch == '~' || ch == '*'
|| ch == '\'' || ch == '('
|| ch == ')') {
sbuf.append((char)ch);
} else if (ch <= 0x007f) { // other ASCII
sbuf.append(hex[ch]);
} else if (ch <= 0x07FF) { // non-ASCII <= 0x7FF
sbuf.append(hex[0xc0 | (ch >> 6)]);
sbuf.append(hex[0x80 | (ch & 0x3F)]);
} else { // 0x7FF < ch <= 0xFFFF
sbuf.append(hex[0xe0 | (ch >> 12)]);
sbuf.append(hex[0x80 | ((ch >> 6) & 0x3F)]);
sbuf.append(hex[0x80 | (ch & 0x3F)]);
}
}
return sbuf.toString();
}

}


/**********************************************************/


Are we having fun yet? In this particular case (a very common case),
1 line of PHP equals over 100 lines of Java. My KLOC just went
through the roof and my employer suggested I take a very long
vacation, in some remote location.

Thanks again all.
I would bet that you could find an Apache Commons API that does this for
you!

Also, if you were using JSP technology, you could use <c:eek:ut
value="${url}" escapeXml="true" /> So, to compare Java code with PHP
doesn't make sense. PHP is /designed/ for HTML templates, Java is not.
JSP is, so it has that same functionality that you expect from PHP.

Hope this helps,
Daniel.
 
S

Steven Simpson

François said:
An exemple, taken from this sitemaps.org:

http://www.example.com/ümlat.php&q=name should become
http://www.example.com/ümlat.php&amp;q=name

The brute force way to do it (and the only way I found could work with
the sitemap.org example) is to take the initial string, parse every
single char, replace <, >, & and " with their escaped version (&amp;,
&lt;, etc.) as Steven indicated, and finally test any remaining char
to be within the range \u0000 to \u007F (the Basic Latin block) and
encode any char outside that range with this class, taken straight out
of the W3C website (http://www.w3.org/International/O-URL-code.html):

You should really do the %-encoding first, then the &;-encoding, for
symmetry with parsing at the other end, where it will be &;-decoded
first, then %-decoded.
public class URLUTF8Encoder

/**
* Encode a string to the "x-www-form-urlencoded" form, enhanced
* with the UTF-8-in-URL proposal. This is what happens:

* <li><p>The space character ' ' is converted into a plus sign '+'.
*

I think this is too much at this stage. This space->plus conversion,
and its corresponding '+'->'%XX', must have already been done in order
to form the URI; you can't do it to a complete URI, as the spaces that
became pluses when it was formed will then become %XX.

If you already have the URI, all you're doing now is making it ASCII
compatible.
Are we having fun yet? In this particular case (a very common case),
1 line of PHP equals over 100 lines of Java.

No, I've just noticed URI.toASCIIString()!
 
O

Owen Jacobson

the way I read RFC 2396 is that reserved chars:
; / ? : @ & = + $ ,
are not supposed to be escaped. Perhaps Patricia could read the RFC
and tell us what it really means.

The character & is used in URLs and URIs to separate parts of the
query, in which case it should be present as an actual & character.
It can also occur inside query paramater names or values, in which
case it should be present in aencoded form, as the string %26. The
example URI Wayne gave uses (unintentionally) ampersands as query
separators, which is why the URI class isn't escaping them; if he
wants to use them as part of the path or part of query parameters or
values he'll have to encode them himself with
URLEncoder.encode(String, String) or similar.

Elsewhere within a URI, ampersands are not reserved and does not
require encoding, except in the scheme part, where they're simply
illegal.
I wish the people who write RFCs would provide examples to illustrate
the true meaning of the lawyerese.

The RFCs tend to be a codification of existing practice, rather than a
prescription. In the case of the URI RFC it's a little vaguer, since
URIs (that are not also URLs) aren't in terribly widespread use and
came about as an attempt to normalize URLs, so the RFC could be seen
as prescriptive rather than informative.

On the whole, this post-hoc RFC process works well: it gives the
people creating prototypes time and freedom to play with ideas and
discard the bad ones without prematurely codifying them in a
standard. It's not perfect, but then, what is?

-Owen
 

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,586
Members
45,088
Latest member
JeremyMedl

Latest Threads

Top