Whitespace removal in html generated by cgi

Discussion in 'Perl Misc' started by Gregory Toomey, Nov 16, 2003.

  1. A few weeks ago a question was asked in this group about removing whitespace from html, in particular from html generated by cgi.
    Here's a simple technique I developed for Linux:


    1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the input verbatim to Perl. The output of the cgi is piped to delspace.pl. our whitespace munger.

    #!/bin/bash
    /usr/bin/perl <<'EOFPERL' | ./delspace.pl
    #your cgi goes here
    use strict;
    $|++;
    print "Content-type:text/html\n\n";
    print " <h1> This is a test <h1> \n";
    print " some more text\n";

    EOFPERL


    2. Now here's delspace.pl, the whitespace remover. It may be a little buggy, but it seems to work for my simple html.

    #!/usr/bin/perl
    my $count=0;
    while(<>){
    # remove trailing whitespace
    s/^\s+//;

    # remove leading whitespace
    s/\s+$//;

    # change internal whitespace to single space
    s/\s+/ /g;

    # remove simple one line comments
    s/<!--.*?-->//;

    # another simple whitespace removal
    s/> </></g;

    #newlines are not needed
    #except for Content-type-text/html\n\n
    # which occurs at the start
    print;
    print "\n" if $count++<4;
    }



    gtoomey
     
    Gregory Toomey, Nov 16, 2003
    #1
    1. Advertising

  2. Gregory Toomey

    Ben Morrow Guest

    [please limit your line lengths to 72 characters]
    [please make sure your blank lines are *actually* blank]

    Gregory Toomey <> wrote:
    > A few weeks ago a question was asked in this group about removing
    > whitespace from html, in particular from html generated by cgi.
    > Here's a simple technique I developed for Linux:
    >
    > 1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the
    > input verbatim to Perl. The output of the cgi is piped to
    > delspace.pl. our whitespace munger.
    >
    > #!/bin/bash


    There is absolutely no need to use bash. If nothing better, use the
    techniques described in perldoc perlipc "Safe Pipe Opens". Better, use
    a tied filehandle or a PerlIO layer on STDOUT. Or simply generate the
    thing without superflous whitespace in the first place.

    <snip>
    > 2. Now here's delspace.pl, the whitespace remover. It may be a
    > little buggy, but it seems to work for my simple html.
    >
    > #!/usr/bin/perl
    > my $count=0;
    > while(<>){
    > # remove trailing whitespace
    > s/^\s+//;
    >
    > # remove leading whitespace
    > s/\s+$//;
    >
    > # change internal whitespace to single space
    > s/\s+/ /g;
    >
    > # remove simple one line comments
    > s/<!--.*?-->//;
    >
    > # another simple whitespace removal
    > s/> </></g;


    You realise this changes the presentation of the HTML?

    > #newlines are not needed
    > #except for Content-type-text/html\n\n
    > # which occurs at the start
    > print;
    > print "\n" if $count++<4;


    Why 4?

    > }


    'A little buggy'? The whole idea's fundamentally flawed: you need to
    start by separating the HTTP from the HTML from the data, which means
    using an HTML parsing module. For instance, what about this:

    <link
    rel=stylesheet
    type="text/css"
    href="..."/>

    Or this:

    Status: 302 Found
    Location: ...
    Content-encoding: ...
    Content-type: text/html
    Content-length: ...

    <html>...

    Or this:

    <pre>
    #!/usr/bin/perl

    use warnings;
    use strict;

    print "Hello world\n";
    </pre>

    Ben

    --
    I've seen things you people wouldn't believe: attack ships on fire off the
    shoulder of Orion; I've watched C-beams glitter in the darkness near the
    Tannhauser Gate. All these moments will be lost, in time, like tears in rain.
    Time to die. |-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-|
     
    Ben Morrow, Nov 16, 2003
    #2
    1. Advertising

  3. It was a dark and stormy night, and Ben Morrow managed to scribble:

    > [please limit your line lengths to 72 characters]
    > [please make sure your blank lines are *actually* blank]
    >
    > Gregory Toomey <> wrote:
    >> A few weeks ago a question was asked in this group about removing
    >> whitespace from html, in particular from html generated by cgi.
    >> Here's a simple technique I developed for Linux:
    >>
    >> 1. A sample cgi. Bash uses the <<'delimiter' conststuct to pass the
    >> input verbatim to Perl. The output of the cgi is piped to
    >> delspace.pl. our whitespace munger.
    >>
    >> #!/bin/bash

    >
    > There is absolutely no need to use bash. If nothing better, use the
    > techniques described in perldoc perlipc "Safe Pipe Opens". Better, use
    > a tied filehandle or a PerlIO layer on STDOUT. Or simply generate the
    > thing without superflous whitespace in the first place.
    >


    The technique I described allows you to take an existing cgi & change 2 lines at the top & one at the bottom.
    What you described will work, but its more complicated.



    > <snip>
    >> 2. Now here's delspace.pl, the whitespace remover. It may be a
    >> little buggy, but it seems to work for my simple html.
    >>
    >> #!/usr/bin/perl
    >> my $count=0;
    >> while(<>){
    >> # remove trailing whitespace
    >> s/^\s+//;
    >>
    >> # remove leading whitespace
    >> s/\s+$//;
    >>
    >> # change internal whitespace to single space
    >> s/\s+/ /g;
    >>
    >> # remove simple one line comments
    >> s/<!--.*?-->//;
    >>
    >> # another simple whitespace removal
    >> s/> </></g;

    >
    > You realise this changes the presentation of the HTML?
    >
    >> #newlines are not needed
    >> #except for Content-type-text/html\n\n
    >> # which occurs at the start
    >> print;
    >> print "\n" if $count++<4;

    >
    > Why 4?
    >
    >> }

    >
    > 'A little buggy'? The whole idea's fundamentally flawed: you need to
    > start by separating the HTTP from the HTML from the data, which means
    > using an HTML parsing module. For instance, what about this:
    >


    It worked with all the cgis I've created.
    Its just a simple pragmatic way to solve a real world problem .


    gtoomey
     
    Gregory Toomey, Nov 16, 2003
    #3
  4. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Gregory Toomey <> wrote in
    news:1933712.m1tGeoNVPB@gregs-web-hosting-and-pickle-farming:

    > A few weeks ago a question was asked in this group about removing
    > whitespace from html, in particular from html generated by cgi. Here's
    > a simple technique I developed for Linux:


    What is the goal of this? Reducing the amount of data that is
    transmitted to the client browser? If so, you would probably be better
    off compressing the output with gzip -- all major browsers support gzip
    compressed data.

    [...]
    > #newlines are not needed
    > #except for Content-type-text/html\n\n
    > # which occurs at the start
    > print;
    > print "\n" if $count++<4;


    Newlines are needed in <pre>...</pre> sections, and sometimes in
    <textarea>...</textarea> sections.

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP7f0GWPeouIeTNHoEQKoQACg4qJhX/JKb6y7ZCOK9eiMVqXih9EAn2px
    YT5a72WavpE6GErYnLOzUQ+d
    =zRRz
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Nov 16, 2003
    #4
  5. On Sun, 16 Nov 2003, Eric J. Roode wrote:

    >> #newlines are not needed
    >> #except for Content-type-text/html\n\n
    >> # which occurs at the start
    >> print;
    >> print "\n" if $count++<4;

    >
    >Newlines are needed in <pre>...</pre> sections, and sometimes in
    ><textarea>...</textarea> sections.


    Not to mention that, although most HTML renders multiple whitespace as a
    SINGLE space, a SINGLE newline IS needed, because the browser will render
    it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
    string like "foo \n bar" is also just rendered as "foo bar".

    --
    Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
    "And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
    years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
    Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
     
    Jeff 'japhy' Pinyan, Nov 16, 2003
    #5
  6. It was a dark and stormy night, and Eric J. Roode managed to scribble:

    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA1
    >
    > Gregory Toomey <> wrote in
    > news:1933712.m1tGeoNVPB@gregs-web-hosting-and-pickle-farming:
    >
    >> A few weeks ago a question was asked in this group about removing
    >> whitespace from html, in particular from html generated by cgi. Here's
    >> a simple technique I developed for Linux:

    >
    > What is the goal of this? Reducing the amount of data that is
    > transmitted to the client browser?

    Yes.
    >If so, you would probably be better
    > off compressing the output with gzip -- all major browsers support gzip
    > compressed data.


    Yes I use Apache with gzip so that's another level of compression.

    People hate waiting for pages to load, especially for people on dialup.

    >
    > [...]
    >> #newlines are not needed
    >> #except for Content-type-text/html\n\n
    >> # which occurs at the start
    >> print;
    >> print "\n" if $count++<4;

    >
    > Newlines are needed in <pre>...</pre> sections, and sometimes in
    > <textarea>...</textarea> sections.
    >
    > - --
    > Eric
    > $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    > ts p , map $ _. $ " , qw e p h tona e and print
    >
    > -----BEGIN PGP SIGNATURE-----
    > Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>
    >
    > iQA/AwUBP7f0GWPeouIeTNHoEQKoQACg4qJhX/JKb6y7ZCOK9eiMVqXih9EAn2px
    > YT5a72WavpE6GErYnLOzUQ+d
    > =zRRz
    > -----END PGP SIGNATURE-----
     
    Gregory Toomey, Nov 16, 2003
    #6
  7. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Jeff 'japhy' Pinyan <> wrote in
    news:p:

    > Not to mention that, although most HTML renders multiple whitespace as a
    > SINGLE space, a SINGLE newline IS needed, because the browser will render
    > it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
    > string like "foo \n bar" is also just rendered as "foo bar".


    Ooh, good point.

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP7gZY2PeouIeTNHoEQJuPwCePA4BQ8lKxNoFVeJK7PeCK7vOgaUAn1xC
    xlc/HAuS24OiXl9X1RTYqVPZ
    =iONd
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Nov 17, 2003
    #7
  8. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Gregory Toomey <> wrote in news:3072218.31r3eYUQgx@gregs-
    web-hosting-and-pickle-farming:

    >
    > People hate waiting for pages to load, especially for people on dialup.


    Have you verified that the extra time your CGI scripts take to execute is
    less than the transfer time of the spaces you are eliminating? :)

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP7gZyWPeouIeTNHoEQJc6QCfRsU9IVVvuPbf1LCJ65Ot7K+TVJUAnRXm
    MizOFx2ThfFeAocFzgE/LLZ/
    =fWE0
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Nov 17, 2003
    #8
  9. It was a dark and stormy night, and Eric J. Roode managed to scribble:

    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA1
    >
    > Gregory Toomey <> wrote in
    > news:3072218.31r3eYUQgx@gregs- web-hosting-and-pickle-farming:
    >
    >>
    >> People hate waiting for pages to load, especially for people on dialup.

    >
    > Have you verified that the extra time your CGI scripts take to execute is
    > less than the transfer time of the spaces you are eliminating? :)
    >


    The server I use for cgi is about 2.6GHz and averages 20% CPU utilisation.
    Running the script to remove whitespace takes under 1 second for 1000 lines of HTML, and does not increase the load to any discernable extent.

    The database-driven cgi I use is disk IO bound, not CPU bound.

    gtoomey
     
    Gregory Toomey, Nov 17, 2003
    #9
  10. It was a dark and stormy night, and Eric J. Roode managed to scribble:

    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA1
    >
    > Jeff 'japhy' Pinyan <> wrote in
    > news:p:
    >
    >> Not to mention that, although most HTML renders multiple whitespace as a
    >> SINGLE space, a SINGLE newline IS needed, because the browser will render
    >> it as a space. That is, "foo\nbar" is rendered as "foo bar", while a
    >> string like "foo \n bar" is also just rendered as "foo bar".

    >
    > Ooh, good point.
    >



    I tried it on a dozen cgis and it worked.

    To make this foolproof your need to write a HTML parser - this is left as an exercise for the reader!

    gtoomey
     
    Gregory Toomey, Nov 17, 2003
    #10
  11. Gregory Toomey wrote:
    > It was a dark and stormy night, and Eric J. Roode managed to scribble:
    >
    >
    >>-----BEGIN PGP SIGNED MESSAGE-----
    >>Hash: SHA1
    >>
    >>Gregory Toomey <> wrote in
    >>news:3072218.31r3eYUQgx@gregs- web-hosting-and-pickle-farming:
    >>
    >>
    >>>People hate waiting for pages to load, especially for people on dialup.

    >>
    >>Have you verified that the extra time your CGI scripts take to execute is
    >>less than the transfer time of the spaces you are eliminating? :)
    >>

    >
    >
    > The server I use for cgi is about 2.6GHz and averages 20% CPU utilisation.
    > Running the script to remove whitespace takes under 1 second for 1000 lines of HTML,
    > and does not increase the load to any discernable extent.
    >
    > The database-driven cgi I use is disk IO bound, not CPU bound.
    >

    Which doesn't answer the question. The question isn't "Are you overloading the
    server?", the question is "Are your users waiting longer for you to remove the
    whitespace than they would wait for the whitespace to download?" Assuming there
    is ten bytes of removable whitespace per line (which would be rather a lot),
    then the whitespace in 1000 lines takes less than two seconds to download on
    a 56K modem. It would take a small fraction of a second with broadband. It
    scarcely seems worth the effort.

    Chris Mattern
     
    Chris Mattern, Nov 17, 2003
    #11
  12. Gregory Toomey <> wrote:

    : It was a dark and stormy night, and Eric J. Roode managed to scribble:
    :> What is the goal of this? Reducing the amount of data that is
    :> transmitted to the client browser?

    : Yes.

    :>If so, you would probably be better
    :> off compressing the output with gzip -- all major browsers support gzip
    :> compressed data.

    : Yes I use Apache with gzip so that's another level of compression.

    If you're gzipping the output stream, then the removal of spaces isn't likely
    to change your transmission size significantly, if at all. The compressor
    will flatten them right out, without risking the content of the HTML.

    Also note that if you have a CGI that sends back something besides HTML,
    such as image or sound data, this will completely screw it up.

    --
    Louis Erickson - - http://www.rdwarf.com/~wwonko/

    Andrea: Unhappy the land that has no heroes.
    Galileo: No, unhappy the land that needs heroes.
    -- Bertolt Brecht, "Life of Galileo"
     
    Louis Erickson, Nov 17, 2003
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Karsten Wutzke
    Replies:
    2
    Views:
    1,958
    Roedy Green
    Nov 4, 2003
  2. Oli Filth
    Replies:
    9
    Views:
    3,359
    Uncle Pirate
    Jan 17, 2005
  3. Kieran Simkin

    Request for comments on HTML tag removal function

    Kieran Simkin, Aug 19, 2004, in forum: C Programming
    Replies:
    4
    Views:
    399
    Smoker
    Aug 19, 2004
  4. Replies:
    10
    Views:
    784
    Eric Brunel
    Dec 16, 2008
  5. MRAB
    Replies:
    3
    Views:
    400
Loading...

Share This Page