Sanitizing HTML strings.

J

jason.cipriani

Is there anything in the Java API that will sanitize strings for
display in HTML (e.g. replace HTML tokens with escape sequences), or
is it normal to roll your own? I'm playing around with Java, and Java
servlets; I'm primarily a C++ programmer and not too familiar with
Java, so sorry if this is a silly question.

Thanks,
Jason
 
S

Stefan Ram

Is there anything in the Java API that will sanitize strings for
display in HTML (e.g. replace HTML tokens with escape sequences), or
is it normal to roll your own?

Some people suggest:

http://commons.apache.org/lang/api/...EscapeUtils.html#escapeHtml(java.lang.String)

But I have found defects in some years ago:

import org.apache.commons.lang.StringEscapeUtils;

public final class Test
{
public static void main( final String[] args )
{
System.out.println( StringEscapeUtils.escapeXml( "a" ) );
System.out.println( StringEscapeUtils.escapeXml( "ä" ));
System.out.println( StringEscapeUtils.escapeXml( "&" ) );
final String text = "\ud800\udc00";
System.out.println( text.codePointCount( 0, text.length()) );
System.out.println( StringEscapeUtils.escapeXml( text )); }}

IIRC, this should show that not only letters with special
meaning in HTML are replaced (what might be wanted or not
wanted) and that code points represented using surrogate pairs
were not handled correctly.

The last time I looked, the JDK had some methods for this, but
most of them were private, protected or not intended for use
by applications:

javax.swing.text.html.HTMLWriter#output(char[] chars, int start, int length)
java.util.logging,XMLFormatter#escape(StringBuffer sb, String text)
java.beans.XMLEncoder#String quoteCharacters(String s)
com.sun.org.apache.xml.internal.serialize.XMLSerializer#printEscaped(String source)
com.sun.org.apache.xml.internal.serialize.XML11Serializer#printEscaped(String source)
com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDAbstractTraverser#processAttValue(String original)

"package"

com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaDOM#processAttValue(String original)

"public":

com.sun.org.apache.xalan.internal.client.XSLTProcessorApplet#escapeString(String s)

So, you might try an approach like in the following class I wrote.

public final class Text
{
public static java.lang.String
sourceCharacter
( final char s )
{ return
( s < 63 && s >= 34 )?
( s < 40 ?
( s == '"' ? """ :
s == '&' ? "&" :
s == '\'' ? "'" : null ) :
s >= 60 ?
( s == '<' ? "<" :
s == '>' ? ">" : null ): null ): null; }

public static java.lang.String
sourceText
( final java.lang.String text )
{ java.lang.StringBuilder buffer = null;
int growth = 0;
final int length = text.length();
for( int i = 0; i < length; ++i )
{ final java.lang.String sourceChar =
sourceCharacter( text.charAt( i ) );
if( sourceChar != null )
{ if( buffer == null )buffer =
new java.lang.StringBuilder( text );
final int position = i + growth;
buffer.replace( position, position + 1, sourceChar );
growth += 4; }}
return buffer == null ? text : buffer.toString(); }

/* untested */
public static void main( final String[] args )
{ java.lang.System.out.println
( sourceText
( "<alpha beta=\"gamma\" delta='epsilon' />" )); }}

This only is intended to encode characters with special
meanings. Depending on the encoding used for the HTML
document, other characters might have to be represented using
character references, too.

Another idea might be to use a
javax.swing.text.html.HTMLEditorKit to write the text into a
document and then serialize it to HTML, but I have not tried
this.
 
M

Mark Space

Is there anything in the Java API that will sanitize strings for
display in HTML (e.g. replace HTML tokens with escape sequences), or
is it normal to roll your own? I'm playing around with Java, and Java
servlets; I'm primarily a C++ programmer and not too familiar with
Java, so sorry if this is a silly question.

I think you can get by with just replacing "&" with "&amp:" and "<" with
"&lt;". This is really a HTML/XML type question, not Jave per se. The
String class has a replaceAll() method you can use.

If you're talking about data validation in general, that's a whole
'nother ball of wax. Be careful about assuming that all you need to do
is "sanatize HTML." I'm not an expert, but it would be wise to become
one before designing a strategy to validate input, especially input
taken from a web form.

This is too complicated really for the Java API. You might try
libraries designed to interact with users on the web, there may be
sub-libraries designed with various validators in mind. Struts is
probably the oldest, and the Struts home page has links to other UI web
oriented frameworks that might be useful, under Similar Projects.

http://struts.apache.org/
 
R

Roedy Green

Is there anything in the Java API that will sanitize strings for
display in HTML (e.g. replace HTML tokens with escape sequences), or
is it normal to roll your own? I'm playing around with Java, and Java
servlets; I'm primarily a C++ programmer and not too familiar with
Java, so sorry if this is a silly question.

I have written some stuff that might prove useful.

http://mindprod.com/products1.html#ENTITIES

which interconverts between Unicode and &xxx; entities
There is also a method to strip out html tags leaving you just the raw
text.

http://mindprod.com/products1.html#AMPER
that convert & to &amp; where appropriate in a malformed HTML
document.

If your HTML is well-formed, you can render it inside Java JLabels and
JTextAreas. see http://mindprod.com/jgloss/htmlrendering.html

There is also a utility http://mindprod.com/applet/quoter.html
that will transform data is many different ways, including converting
HTML to Java string literals.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top