Sanitizing HTML strings.

Discussion in 'Java' started by jason.cipriani@gmail.com, Jun 19, 2008.

  1. Guest

    Is there anything in the Java API that will sanitize strings for
    display in HTML (e.g. replace HTML tokens with escape sequences), or
    is it normal to roll your own? I'm playing around with Java, and Java
    servlets; I'm primarily a C++ programmer and not too familiar with
    Java, so sorry if this is a silly question.

    Thanks,
    Jason
     
    , Jun 19, 2008
    #1
    1. Advertising

  2. Stefan Ram Guest

    "" <> writes:
    >Is there anything in the Java API that will sanitize strings for
    >display in HTML (e.g. replace HTML tokens with escape sequences), or
    >is it normal to roll your own?


    Some people suggest:

    http://commons.apache.org/lang/api/...gEscapeUtils.html#escapeHtml(java.lang.String)

    But I have found defects in some years ago:

    import org.apache.commons.lang.StringEscapeUtils;

    public final class Test
    {
    public static void main( final String[] args )
    {
    System.out.println( StringEscapeUtils.escapeXml( "a" ) );
    System.out.println( StringEscapeUtils.escapeXml( "รค" ));
    System.out.println( StringEscapeUtils.escapeXml( "&" ) );
    final String text = "\ud800\udc00";
    System.out.println( text.codePointCount( 0, text.length()) );
    System.out.println( StringEscapeUtils.escapeXml( text )); }}

    IIRC, this should show that not only letters with special
    meaning in HTML are replaced (what might be wanted or not
    wanted) and that code points represented using surrogate pairs
    were not handled correctly.

    The last time I looked, the JDK had some methods for this, but
    most of them were private, protected or not intended for use
    by applications:

    javax.swing.text.html.HTMLWriter#output(char[] chars, int start, int length)
    java.util.logging,XMLFormatter#escape(StringBuffer sb, String text)
    java.beans.XMLEncoder#String quoteCharacters(String s)
    com.sun.org.apache.xml.internal.serialize.XMLSerializer#printEscaped(String source)
    com.sun.org.apache.xml.internal.serialize.XML11Serializer#printEscaped(String source)
    com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDAbstractTraverser#processAttValue(String original)

    "package"

    com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaDOM#processAttValue(String original)

    "public":

    com.sun.org.apache.xalan.internal.client.XSLTProcessorApplet#escapeString(String s)

    So, you might try an approach like in the following class I wrote.

    public final class Text
    {
    public static java.lang.String
    sourceCharacter
    ( final char s )
    { return
    ( s < 63 && s >= 34 )?
    ( s < 40 ?
    ( s == '"' ? """ :
    s == '&' ? "&" :
    s == '\'' ? "'" : null ) :
    s >= 60 ?
    ( s == '<' ? "<" :
    s == '>' ? ">" : null ): null ): null; }

    public static java.lang.String
    sourceText
    ( final java.lang.String text )
    { java.lang.StringBuilder buffer = null;
    int growth = 0;
    final int length = text.length();
    for( int i = 0; i < length; ++i )
    { final java.lang.String sourceChar =
    sourceCharacter( text.charAt( i ) );
    if( sourceChar != null )
    { if( buffer == null )buffer =
    new java.lang.StringBuilder( text );
    final int position = i + growth;
    buffer.replace( position, position + 1, sourceChar );
    growth += 4; }}
    return buffer == null ? text : buffer.toString(); }

    /* untested */
    public static void main( final String[] args )
    { java.lang.System.out.println
    ( sourceText
    ( "<alpha beta=\"gamma\" delta='epsilon' />" )); }}

    This only is intended to encode characters with special
    meanings. Depending on the encoding used for the HTML
    document, other characters might have to be represented using
    character references, too.

    Another idea might be to use a
    javax.swing.text.html.HTMLEditorKit to write the text into a
    document and then serialize it to HTML, but I have not tried
    this.
     
    Stefan Ram, Jun 19, 2008
    #2
    1. Advertising

  3. Mark Space Guest

    wrote:
    > Is there anything in the Java API that will sanitize strings for
    > display in HTML (e.g. replace HTML tokens with escape sequences), or
    > is it normal to roll your own? I'm playing around with Java, and Java
    > servlets; I'm primarily a C++ programmer and not too familiar with
    > Java, so sorry if this is a silly question.


    I think you can get by with just replacing "&" with "&amp:" and "<" with
    "&lt;". This is really a HTML/XML type question, not Jave per se. The
    String class has a replaceAll() method you can use.

    If you're talking about data validation in general, that's a whole
    'nother ball of wax. Be careful about assuming that all you need to do
    is "sanatize HTML." I'm not an expert, but it would be wise to become
    one before designing a strategy to validate input, especially input
    taken from a web form.

    This is too complicated really for the Java API. You might try
    libraries designed to interact with users on the web, there may be
    sub-libraries designed with various validators in mind. Struts is
    probably the oldest, and the Struts home page has links to other UI web
    oriented frameworks that might be useful, under Similar Projects.

    http://struts.apache.org/
     
    Mark Space, Jun 19, 2008
    #3
  4. Roedy Green Guest

    On Thu, 19 Jun 2008 11:20:20 -0700 (PDT), ""
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Is there anything in the Java API that will sanitize strings for
    >display in HTML (e.g. replace HTML tokens with escape sequences), or
    >is it normal to roll your own? I'm playing around with Java, and Java
    >servlets; I'm primarily a C++ programmer and not too familiar with
    >Java, so sorry if this is a silly question.


    I have written some stuff that might prove useful.

    http://mindprod.com/products1.html#ENTITIES

    which interconverts between Unicode and &xxx; entities
    There is also a method to strip out html tags leaving you just the raw
    text.

    http://mindprod.com/products1.html#AMPER
    that convert & to &amp; where appropriate in a malformed HTML
    document.

    If your HTML is well-formed, you can render it inside Java JLabels and
    JTextAreas. see http://mindprod.com/jgloss/htmlrendering.html

    There is also a utility http://mindprod.com/applet/quoter.html
    that will transform data is many different ways, including converting
    HTML to Java string literals.
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Jun 20, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jim Washington

    Sanitizing untrusted code for eval()

    Jim Washington, Aug 22, 2005, in forum: Python
    Replies:
    9
    Views:
    506
    Alan Kennedy
    Aug 23, 2005
  2. Petr Muller
    Replies:
    0
    Views:
    221
    Petr Muller
    Mar 9, 2009
  3. Petr Muller
    Replies:
    0
    Views:
    485
    Petr Muller
    Mar 9, 2009
  4. Tim Chase
    Replies:
    1
    Views:
    294
    Jean-Paul Calderone
    Jun 1, 2011
  5. Aljaz Fajmut

    sanitizing html tags (content)

    Aljaz Fajmut, Oct 22, 2009, in forum: Ruby
    Replies:
    2
    Views:
    89
    Mike Dalessio
    Oct 23, 2009
Loading...

Share This Page