Servlet or filter or something to remove whitespace?

Discussion in 'Java' started by Ryan Stewart, Jun 9, 2004.

  1. Ryan Stewart

    Ryan Stewart Guest

    I didn't really mean to make two posts so close together, but I just thought
    of something I was considering at work today. I'm building a web application
    which has some potentially very large generated pages. One JSP in particular
    I've been working on can generate HTML over 1M in size. Due to the taglibs
    and indentation and so on, a *lot* of this is just whitespace. Does anyone
    know of a servlet or filter (or Struts plug-in, as that's what we're using)
    that would remove or compress the extra whitespace? I plan on trying to
    write something, but we're on a tight deadline atm, so it'll have to wait a
    while.
    Ryan Stewart, Jun 9, 2004
    #1
    1. Advertising

  2. Ryan Stewart

    Real Gagnon Guest

    > is just whitespace. Does anyone know of a servlet or filter (or Struts
    > plug-in, as that's what we're using) that would remove or compress the
    > extra whitespace? I plan on trying to write something, but we're on a
    > tight deadline atm, so it'll have to wait a while.


    See Trim Filter ver. 1.5, at
    http://www.servletsuite.com/servlets/trimflt.htm

    you may also consider "the Compress Filter ver. 1.3"
    http://www.servletsuite.com/servlets/gzipflt.htm

    Bye.
    --
    Real Gagnon from Quebec, Canada
    * Looking for Java or PB snippets ? Visit Real's How-to
    * http://www.rgagnon.com/howto.html
    Real Gagnon, Jun 9, 2004
    #2
    1. Advertising

  3. On Tue, 8 Jun 2004 19:24:15 -0500, Ryan Stewart wrote:

    > One JSP in particular
    > I've been working on can generate HTML over 1M in size. Due to the taglibs
    > and indentation and so on, a *lot* of this is just whitespace.


    Could this be a job for..
    res.setHeader("Content-Encoding", "gzip")
    res.setHeader("Content-Encoding", "x-compress")
    res.setHeader("Content-Encoding", "compress")
    ?

    [ OK, that is from my servlet
    book, but be lateral.. ]

    --
    Andrew Thompson
    http://www.PhySci.org/ Open-source software suite
    http://www.PhySci.org/codes/ Web & IT Help
    http://www.1point1C.org/ Science & Technology
    Andrew Thompson, Jun 9, 2004
    #3
  4. Ryan Stewart

    Roedy Green Guest

    On Wed, 09 Jun 2004 01:09:03 GMT, Andrew Thompson
    <> wrote or quoted :

    >> I've been working on can generate HTML over 1M in size. Due to the taglibs
    >> and indentation and so on, a *lot* of this is just whitespace.


    you might want to use Compactor logic on it to remove excess
    whitespace.

    Look at the html on my website to see the result.

    That works even if the browser does not support gzip.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 9, 2004
    #4
  5. Ryan Stewart

    Liz Guest

    "Ryan Stewart" <> wrote in message
    news:...
    > I didn't really mean to make two posts so close together, but I just

    thought
    > of something I was considering at work today. I'm building a web

    application
    > which has some potentially very large generated pages. One JSP in

    particular
    > I've been working on can generate HTML over 1M in size. Due to the taglibs
    > and indentation and so on, a *lot* of this is just whitespace. Does anyone
    > know of a servlet or filter (or Struts plug-in, as that's what we're

    using)
    > that would remove or compress the extra whitespace? I plan on trying to
    > write something, but we're on a tight deadline atm, so it'll have to wait

    a
    > while.
    >

    Since you are on a tight deadline it would be wasting your time to make this
    improvement now. Most links can handle the size easily and most of the
    slower
    ones like 56KB/s have compression built into the modem or ppp protocol. So
    keep your eyes on the main job for now.
    Liz, Jun 9, 2004
    #5
  6. Ryan Stewart

    Ryan Stewart Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Wed, 09 Jun 2004 01:09:03 GMT, Andrew Thompson
    > <> wrote or quoted :
    >
    > >> I've been working on can generate HTML over 1M in size. Due to the

    taglibs
    > >> and indentation and so on, a *lot* of this is just whitespace.

    >
    > you might want to use Compactor logic on it to remove excess
    > whitespace.
    >
    > Look at the html on my website to see the result.
    >
    > That works even if the browser does not support gzip.
    >

    Yes, something like that is what I would like. It'd be nice if you could get
    rid of the cr/lf's as well. What is this "Compactor logic"?
    Ryan Stewart, Jun 9, 2004
    #6
  7. Ryan Stewart

    Roedy Green Guest

    On Tue, 8 Jun 2004 22:22:06 -0500, "Ryan Stewart"
    <> wrote or quoted :

    >Yes, something like that is what I would like. It'd be nice if you could get
    >rid of the cr/lf's as well. What is this "Compactor logic"?


    See http://mindprod.com/projects/htmlcompactor.html
    It collapses CrLf to Lf and gets rid of lead/trail space on each line.
    You can take out more than that without changing how it renders.

    I have implemented the part of the project that does not require a
    browser plugin.

    I have not released the code publicly yet. It needs some
    documentation and a little polishing. It works fine. I have been
    using it in production over a year without incident.

    I'd be willing to give it to you on an asis basis, which would
    probably prompt me to prepare it for its public debut.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 9, 2004
    #7
  8. Ryan Stewart

    Liz Guest

    >
    > I have not released the code publicly yet. It needs some
    > documentation and a little polishing. It works fine. I have been
    > using it in production over a year without incident.
    >

    How can you have something in production that is not documented?????

    BTW: is this bottom post format ok?
    Liz, Jun 9, 2004
    #8
  9. Ryan Stewart

    Roedy Green Guest

    On Wed, 09 Jun 2004 05:57:05 GMT, "Liz" <> wrote or
    quoted :

    >How can you have something in production that is not documented?????


    To be published, I need to package it with end-user level
    documentation. For my own use the JavaDoc suffices plus the project
    description.

    It is also a matter of bundling up zips with all the relevant source
    code.

    There is also the matter of separating out code that is not of general
    interest.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 9, 2004
    #9
  10. Ryan Stewart

    Liz Guest

    "Roedy Green" <> wrote in message
    news:p...
    > On Wed, 09 Jun 2004 05:57:05 GMT, "Liz" <> wrote or
    > quoted :
    >
    > >How can you have something in production that is not documented?????

    >
    > To be published, I need to package it with end-user level
    > documentation. For my own use the JavaDoc suffices plus the project
    > description.
    >
    > It is also a matter of bundling up zips with all the relevant source
    > code.
    >
    > There is also the matter of separating out code that is not of general
    > interest.
    >
    > --

    So it is not in production, just developer test.
    Liz, Jun 9, 2004
    #10
  11. Andrew Thompson wrote:

    > On Tue, 8 Jun 2004 19:24:15 -0500, Ryan Stewart wrote:
    >
    >
    >> One JSP in particular
    >>I've been working on can generate HTML over 1M in size. Due to the taglibs
    >>and indentation and so on, a *lot* of this is just whitespace.

    >
    >
    > Could this be a job for..
    > res.setHeader("Content-Encoding", "gzip")
    > res.setHeader("Content-Encoding", "x-compress")
    > res.setHeader("Content-Encoding", "compress")
    > ?


    Only if you actually perform the specified compression on the content,
    so that doesn't help Ryan by itself. It must also be acceptable that
    some clients may not be able to decompress the content, as support for
    these encodings (indeed, for almost all content encodings) is optional
    for HTTP 1.1 user agents.

    It ought to be pretty easy to write a custom tag to wrap around the
    documents that would remove leading and trailing space from each line.
    Beyond that, though, you have to start worrying where you can remove
    whitespace without changing document semantics. (Even with just
    leading-space removal you have to worry about mangling preformatted
    text, so if you need to worry about that then you need a smarter tag.)
    There are surely several other solutions.


    John Bollinger
    John C. Bollinger, Jun 9, 2004
    #11
  12. Ryan Stewart

    Roedy Green Guest

    On Wed, 09 Jun 2004 17:54:25 -0500, "John C. Bollinger"
    <> wrote or quoted :

    >Beyond that, though, you have to start worrying where you can remove
    >whitespace without changing document semantics.


    It is very simple. Basically any string of whitespace can be collapsed
    to a single whitespace except in <pre> and inside " ..."

    See http://mindprod.com/projects/htmlcompactor.html for details.

    Compaction is very fast, done with a simple state machine.


    here's the core of it.

    package com.mindprod.compactor;

    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.util.Iterator;

    import com.mindprod.filter.AllDirectoriesFilter;
    import com.mindprod.filter.ClamFilter;
    import com.mindprod.filter.CommandLine;
    import com.mindprod.hunkio.HunkIO;

    /**
    * Compacts HTML, possibly java source or other text
    *
    * @author Roedy Green
    * @version 1.0
    */
    public class Compactor
    {

    /**
    * Constructor
    */
    public Compactor ( )
    {
    }

    /**
    * true collapse multiple spaces to one.
    */
    public boolean compactWhitespace = true;

    /**
    * collapse whitespace even inside comments.
    * false not implemented.
    */
    public static final boolean compactWhitespaceInComments = true;

    /**
    * convert html tags to lower case.
    * true not implemented.
    */
    public static final boolean lowerCaseTags = false;

    /**
    * maximum allowable blank lines. usually 0 or 1.
    */
    public int maxAllowableBlankLines = 0;

    /**
    * true use Unix \n line terminator,
    * false use Windows \r\n or other platform-specific terminator
    */
    public boolean oneCharLf = true;

    /**
    * true strip out all comments except SSI and macros,
    * true not implemented
    */
    public static final boolean removeComments = false;

    /**
    * true remove leading space from lines.
    */
    public boolean removeLeadSpaces = true;

    /**
    * true remove trailing space from lines.
    * false will never be implemented.
    */
    public static final boolean removeTrailingSpaces = true;

    /**
    * true removes what htmlmacros generate.
    * Would normally be false if you are going
    * to send this to the web.
    * true not implemented.
    */
    public static final boolean removeMacroGenerations = false;

    /**
    * true remove macros.
    * This hides how you generated your HTML from
    * the outside world.
    * The catch is, you can never regenerate your macros again.
    * If true, then removeMacroGenerations should be
    * false.
    * true not implemented.
    */
    public static final boolean removeMacros = false;

    /**
    * Remove unnecessary space on either side of tags.
    * It depends on the tag and the amount
    * of space on the other side of the tag
    * whether space can be completely removed.
    * true not implemented.
    */
    public static final boolean removeSpaceAroundTags = false;

    /**
    * Consolidate tags. e.g <span class="x">this
    * </span><span class="x">and that</span>
    * can be collapsed to <span
    * class="x">this and that</span>.
    * true not implemented.
    */
    public static final boolean consolidateTags = false;

    /**
    * true convert to CBF, compact binary format.
    * The catch here is web browsers can't read this
    * without a plugin. This is the main compaction.
    * true not implemented.
    */
    public static final boolean tokenise = false;

    /**
    * true LZW compression.
    * the catch is, browsers can't read this without
    * a special plugin.
    * true not implemented
    */
    public static final boolean zip = false;

    /**
    * StringBuffer to accumulate the result file
    * character by character.
    */
    protected StringBuffer sb;

    /**
    * how many spaces we have outstanding we have not
    * yet put in the StringBuffer.
    */
    protected int pendingSpaces;

    /**
    * HowMany newLines we have outstanding we have not
    * put in the StringBuffer.
    */
    protected int pendingNewLines;

    /**
    * true if inside <pre>...</pre> where spaces preserved
    */
    protected boolean inPre;

    /**
    * put the pending newlines and spaces into
    * the StringBuffer.
    */
    protected void emitPending()
    {
    // adjust pending newlines, but leave <pre> completely alone.
    if ( pendingNewLines > maxAllowableBlankLines+1 && !inPre )
    {
    pendingNewLines = maxAllowableBlankLines+1;
    }

    // adjust pending spaces
    if ( pendingNewLines > 0 )
    {
    if ( removeLeadSpaces && !inPre )
    {
    pendingSpaces = 0;
    }
    }
    else if ( compactWhitespace && pendingSpaces > 0 && !inPre )
    {
    pendingSpaces = 1;
    }

    // emit pending newLines
    for ( int i=0; i<pendingNewLines; i++ )
    {

    if ( oneCharLf )
    {
    sb.append ( '\n' );
    }
    else
    {
    sb.append ( lineSeparator );
    }
    }
    pendingNewLines = 0;

    // emit pending spaces
    for ( int i=0; i<pendingSpaces; i++ )
    {
    sb.append( ' ' );
    }

    pendingSpaces = 0;

    }
    /**
    * platform specific line separator char
    */
    private static String lineSeparator = System.getProperty (
    "line.separator" );
    /**
    * compact and tidy one file.
    *
    * @param fileBeingProcessed
    * File to compact and tidy.
    * @param quiet true if want progress messages suppressed
    *
    * @exception IOException
    */
    public void compactFile( File fileBeingProcessed, boolean quiet )
    throws IOException
    {

    if ( ! quiet )
    {
    System.out.print(" compacting " +
    fileBeingProcessed.getName()+ " " );
    }
    if ( ! (fileBeingProcessed.getName().endsWith(".html") ||
    fileBeingProcessed.getName().endsWith(".htm")) )
    {
    System.out.println( "Cannot compact: " +
    fileBeingProcessed.getName() + "not .html file");
    return;
    }
    String big = HunkIO.readEntireFile( fileBeingProcessed );

    String result = compactString( big );
    if ( result.equals( big ) )
    {
    // nothing changed. No need to write results.

    if ( ! quiet )
    {
    System.out.println( "-" );
    }
    return;
    }
    // generate output into a temporary file until we are sure all
    is ok.
    // create a temp file in the same directory as filename
    if ( ! quiet )
    {
    System.out.println( "*" );
    }
    File tempfile = HunkIO.createTempFile ("temp", ".tmp",
    fileBeingProcessed );
    FileWriter emit = new FileWriter( tempfile );
    emit.write( result );
    emit.close();
    // successfully created output in same directory as input,
    // Now make it replace the input file.
    fileBeingProcessed.delete();
    tempfile.renameTo( fileBeingProcessed );
    } // end processFile

    /**
    * compact the string by removing whitespace.
    *
    * @param big Fluffy string you want compacted.
    * @return compacted string.
    */
    public String compactString ( String big )
    {
    int originalLength = big.length();

    sb = new StringBuffer( originalLength );

    pendingSpaces = 0;

    pendingNewLines = 0;

    inPre = false;

    // loop through each char categorising it
    // deal with
    // removing lead spaces.
    // removing trail spaces.
    // collapsing excess whitespace
    for ( int i=0; i<originalLength; i++ )
    {
    char c = big.charAt( i );
    switch ( c )
    {

    default:

    // deal with pending newlines and spaces first
    if ( pendingSpaces > 0 || pendingNewLines > 0 )
    {
    emitPending();
    }
    // now emit the normal character.
    sb.append( c );
    break;

    case '\r':
    /* should we ignore this \r ? */
    /* We do if it was immediately followed by a \n */
    if ( !( i+1 < originalLength && big.charAt( i+1 ) ==
    '\n' ) )
    {
    // it was a standalone Mac style \r, treat like \n
    // always ignore trailing spaces, even when inPre =
    true.
    pendingSpaces = 0;
    pendingNewLines++;
    }
    // otherwise ignore it
    break;

    case '\n':
    // always ignore trailing spaces, even when inPre =
    true.
    pendingSpaces = 0;
    pendingNewLines++;
    break;

    case '<':
    // deal with pending newlines and spaces first
    if ( pendingSpaces > 0 || pendingNewLines > 0 )
    {
    emitPending();
    }
    if ( inPre )
    {
    if ( ( i + 6 <= originalLength ) && big.substring(
    i, i+6 ).equalsIgnoreCase("</pre>") )
    {
    inPre = false;
    }
    }
    else
    {
    if ( ( i + 5 <= originalLength ) && big.substring(
    i, i+5 ).equalsIgnoreCase("<pre>") )
    {
    inPre = true;
    }
    }
    // now emit the normal character.
    sb.append( '<' );
    break;

    case 0:
    case 1:
    case 2:
    case 3:
    case 4:
    case 5:
    case 6:
    case 7:
    case 8:
    case '\t': // 9 tab
    // case 10: lf
    case 11:
    case 12:
    // case 13: cr
    case 14:
    case 15:
    case 16:
    case 17:
    case 18:
    case 19:
    case 20:
    case 21:
    case 22:
    case 23:
    case 24:
    case 25:
    case 26:
    case 27:
    case 28:
    case 29:
    case 30:
    case 31:
    case ' ': // 32: space
    pendingSpaces++;
    } // end switch

    } // end for
    // allow file to end with out a final newline, but no more than
    one.
    pendingSpaces = 0;
    if ( pendingNewLines > 1 )
    {
    pendingNewLines = 1;
    }
    emitPending();

    return sb.toString();

    } // end processString

    /**
    * compacts HTML files.
    *
    * @param args names of files to process, dirs, files, -s, *.*,
    no wildards.
    */
    public static void main ( String[] args )
    {
    // gather all the files mentioned on the command line.
    // either directories, files, *.*, with -s and subdirs option.

    System.out.println( "Gathering files to process..." );
    Iterator wantedFiles =
    CommandLine.getFilesToProcess(
    args, /* what is on the command
    line */
    1000, /* estimate of expected files
    */
    new AllDirectoriesFilter(),
    new ClamFilter( "", ".html" )
    );
    Compactor c = new Compactor();
    for ( Iterator iter=wantedFiles; iter.hasNext(); )
    {
    File file = (File)iter.next();
    try
    {
    c.compactFile( file , false /* not quiet */ );
    }
    catch ( FileNotFoundException e )
    {
    System.out.println( "Error: " + file.getAbsolutePath() +
    " not found." );
    }
    catch ( Exception e )
    {
    System.out.println( e.getMessage() + " in file " +
    file.getAbsolutePath() );
    System.out.println();
    e.printStackTrace();
    }
    } // end for
    } // end main

    } // end Compactor


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 10, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. circuit_breaker
    Replies:
    2
    Views:
    1,989
    Jack Jia
    Apr 4, 2004
  2. Oli Filth
    Replies:
    9
    Views:
    3,319
    Uncle Pirate
    Jan 17, 2005
  3. Replies:
    10
    Views:
    721
    Eric Brunel
    Dec 16, 2008
  4. Rustom Mody
    Replies:
    4
    Views:
    556
    Steve Howell
    May 11, 2009
  5. MRAB
    Replies:
    3
    Views:
    373
Loading...

Share This Page