html compression tools (command line)

E

Errol Smith

Hi,

Does anyone know of command line tools for html compression?
The only one I am aware of is htmlcrunch
(http://www.markusstengel.de/htmlcr.html) but, frankly, this does not
perform very well (often makes the input file bigger!).
This is for my website compression tool 'webpack'
(http://www.kludgesoft.com/nix/webpack.html - blatant plug :), for
which I am trying to avoid writing a better html compressor myself.
Rather not re-invent the wheel, you know.
Also if there is any information around on making html more
compressible, I would appreciate pointers to information/tools (the
only method I've heard of is making all html tags lower case, but
there may be other methods).
Any assistance appreciated!

Errol Smith
errol <at> ros (dot) com [period] au
 
R

rf

Errol Smith said:
Hi,

Does anyone know of command line tools for html compression?

This has been discussed here before. The general consensus is that it is a
waste of time. Look to other things first: Image compression. A badly
compressed image will waste far more bandwidth than compressing the HTML
will save; Number of images: 10 images on a page results in 10 round trips
back to the server, a elapsed time of hundreds of milliseconds, perhaps even
a number of seconds. Compressing the HTML might save ten or so milliseconds.
This is for my website compression tool 'webpack'
(http://www.kludgesoft.com/nix/webpack.html - blatant plug :),

I note that you don't compress *this* page :) You even have great sequences
of cr/lf in there.
 
J

Jim Higson

Errol said:
Hi,

Does anyone know of command line tools for html compression?
The only one I am aware of is htmlcrunch
(http://www.markusstengel.de/htmlcr.html) but, frankly, this does not
perform very well (often makes the input file bigger!).
This is for my website compression tool 'webpack'
(http://www.kludgesoft.com/nix/webpack.html - blatant plug :), for
which I am trying to avoid writing a better html compressor myself.
Rather not re-invent the wheel, you know.
Also if there is any information around on making html more
compressible, I would appreciate pointers to information/tools (the
only method I've heard of is making all html tags lower case, but
there may be other methods).
Any assistance appreciated!

Errol Smith
errol <at> ros (dot) com [period] au

gzip is command line!
With mod_gzip or mod_gunzip on an Apache server all your pages are sent
gzipped, completely transparently to most browsers (even IE!) but expanded
for the very few that can't handle content-encoding gzip.

Will reduce page size by about 60%, but should be in adition to, rather than
instead of, compact markup.

Regardless of compression, I try to keep pages below 10k. Like others have
said, it is easy to have images larger than this size. Well compressed 8bit
PNGs and jpegs should help here.

http://www.innerjoin.org/apache-compression/howto.html
 
J

Jim Higson

Jim said:
Errol said:
Hi,

Does anyone know of command line tools for html compression?
The only one I am aware of is htmlcrunch
(http://www.markusstengel.de/htmlcr.html) but, frankly, this does not
perform very well (often makes the input file bigger!).
This is for my website compression tool 'webpack'
(http://www.kludgesoft.com/nix/webpack.html - blatant plug :), for
which I am trying to avoid writing a better html compressor myself.
Rather not re-invent the wheel, you know.
Also if there is any information around on making html more
compressible, I would appreciate pointers to information/tools (the
only method I've heard of is making all html tags lower case, but
there may be other methods).
Any assistance appreciated!

Errol Smith
errol <at> ros (dot) com [period] au

gzip is command line!
With mod_gzip or mod_gunzip on an Apache server all your pages are sent
gzipped, completely transparently to most browsers (even IE!) but expanded
for the very few that can't handle content-encoding gzip.

Will reduce page size by about 60%, but should be in adition to, rather
than instead of, compact markup.

Regardless of compression, I try to keep pages below 10k. Like others have
said, it is easy to have images larger than this size. Well compressed
8bit PNGs and jpegs should help here.

http://www.innerjoin.org/apache-compression/howto.html


Incidently, I'd forget about making pages more compressable prior to
gzipping, information theory is not on your side.

I'd also forget about writing a better compression tool than gzip, unless
you are SERIOUSLY into maths. 7zip might give slightly better results in
some cases, but AFAIK no browsers accept is as a Content-Encoding.
 
J

Jim Higson

Jim said:
Jim said:
Errol said:
Hi,

Does anyone know of command line tools for html compression?
The only one I am aware of is htmlcrunch
(http://www.markusstengel.de/htmlcr.html) but, frankly, this does not
perform very well (often makes the input file bigger!).
This is for my website compression tool 'webpack'
(http://www.kludgesoft.com/nix/webpack.html - blatant plug :), for
which I am trying to avoid writing a better html compressor myself.
Rather not re-invent the wheel, you know.
Also if there is any information around on making html more
compressible, I would appreciate pointers to information/tools (the
only method I've heard of is making all html tags lower case, but
there may be other methods).
Any assistance appreciated!

Errol Smith
errol <at> ros (dot) com [period] au

gzip is command line!
With mod_gzip or mod_gunzip on an Apache server all your pages are sent
gzipped, completely transparently to most browsers (even IE!) but
expanded for the very few that can't handle content-encoding gzip.

Will reduce page size by about 60%, but should be in adition to, rather
than instead of, compact markup.

Regardless of compression, I try to keep pages below 10k. Like others
have said, it is easy to have images larger than this size. Well
compressed 8bit PNGs and jpegs should help here.

http://www.innerjoin.org/apache-compression/howto.html


Incidently, I'd forget about making pages more compressable prior to
gzipping, information theory is not on your side.

I'd also forget about writing a better compression tool than gzip, unless
you are SERIOUSLY into maths. 7zip might give slightly better results in
some cases, but AFAIK no browsers accept is as a Content-Encoding.

Man, gotta stop replying to myself, but...

check out Perl's HTLM::Clean it already does much of what you are (maybe)
trying to do. Anyone on an Apache server can use it as a filter for dynamic
content, or apply it offline for static pages.

http://www.perl.com/pub/a/2003/04/17/filters.html
 
E

Errol Smith

This has been discussed here before. The general consensus is that it is a
waste of time. Look to other things first: Image compression. A badly
compressed image will waste far more bandwidth than compressing the HTML
will save; Number of images: 10 images on a page results in 10 round trips
back to the server, a elapsed time of hundreds of milliseconds, perhaps even
a number of seconds. Compressing the HTML might save ten or so milliseconds.

I know, I read previous posts, but I will not be discouraged, as a I
believe in the every-byte-counts theory :)
My tool is intended to cover all bases anyway - it already optimises
JPG, GIF & PNG images. My tools aim is the last-step prior to
publishing, just to automatically shave off a few K here & there. It
won't help you if you save your JPG's with 100% quality and use WORD
as your html editor ;)
I note that you don't compress *this* page :) You even have great sequences
of cr/lf in there.

Actually I _do_, but like I said, htmlcrunch is not very good :)

Errol Smith
errol <at> ros (dot) com [period] au
 
E

Errol Smith

Man, gotta stop replying to myself, but...

check out Perl's HTLM::Clean it already does much of what you are (maybe)
trying to do. Anyone on an Apache server can use it as a filter for dynamic
content, or apply it offline for static pages.

http://www.perl.com/pub/a/2003/04/17/filters.html

Jim,

Thankyou very much for your (mutliple) replies!
I am aware of the gzip functionality in webservers/browsers, I am
more interested in html cleaning/optimising (ie. "compact markup").
This means that browsers and/or servers not supporting those encoding
methods still benefit, plus even with gzip encoding, the resultant
compressed file will still be smaller than if the html had not been
compacted first.
As for the "making more compressible" I know this is a niche topic
and there is probably not much to be gained but it interests me anyway
:) (I do have some knowledge/experience of compression). I can see
how making the case of all tags consistent would improve compression
(more dictionary matches), but there may be more to it.
Oh, and I am definately NOT looking to write a new compressor like
gzip etc, only an HTML compacter :) (7zip is very good but not fully
cross platform yet. If there is going to be any new kind of encoding
standard I would expect it to be .bz2, though it may not be suitable
for on-the-fly compression due it's large block size).
Perl's HTML::Clean looks like what I need - I will have to experiment
with it (but first remember how to use Perl! :)
Thanks again, I will keep hunting.

Errol Smith
errol <at> ros (dot) com [period] au
 
J

Jim Higson

Errol said:
Man, gotta stop replying to myself, but...

check out Perl's HTLM::Clean it already does much of what you are (maybe)
trying to do. Anyone on an Apache server can use it as a filter for
dynamic content, or apply it offline for static pages.

http://www.perl.com/pub/a/2003/04/17/filters.html

Jim,

Thankyou very much for your (mutliple) replies!
I am aware of the gzip functionality in webservers/browsers, I am
more interested in html cleaning/optimising (ie. "compact markup").
This means that browsers and/or servers not supporting those encoding
methods still benefit, plus even with gzip encoding, the resultant
compressed file will still be smaller than if the html had not been
compacted first.
As for the "making more compressible" I know this is a niche topic
and there is probably not much to be gained but it interests me anyway
:) (I do have some knowledge/experience of compression). I can see
how making the case of all tags consistent would improve compression
(more dictionary matches), but there may be more to it.
Oh, and I am definately NOT looking to write a new compressor like
gzip etc, only an HTML compacter :) (7zip is very good but not fully
cross platform yet. If there is going to be any new kind of encoding
standard I would expect it to be .bz2, though it may not be suitable
for on-the-fly compression due it's large block size).
Perl's HTML::Clean looks like what I need - I will have to experiment
with it (but first remember how to use Perl! :)
Thanks again, I will keep hunting.

Errol Smith
errol <at> ros (dot) com [period] au

I have a better idea of what you're trying to do now. I quite like the idea.
I don't think very well made pages could be shrunk much, but for some guy's
homepage you might be onto something. Some ideas:

* Replacing class and id names with single letter identifiers in the html
and css? Might not save much if the file is gzipped since they're repeated
strings anyway, but might be worth a few bytes. Will also make the code
harder to read, so personally I would avoid.
* Replacing long URLs to pages on the same site with hrefs to symlinks on
the server, with much smaller names? Static pages only I'm affraid.
* Lossy PNG compression (google for it!) and conversion of PNGs to indexed
* stripping comments, lf and cr. I don't like this much becasue I think you
should be able to look at the html of a site, but would save a little
space.
* A thumbnail maker that makes the thumbs from lossless versions of the
artwork, not the published jpeg version so images aren't compressed twice.
I do this on sites I create, my thumbs are a *little* smaller and a *tiny*
bit higher quality because of it
* automatic replacing of img tags with objects, where it is more compact, or
with divs and css where the image isn't content. Not sure how you could
tell, mind.
* Decision-tree induction to convert font tags to css. A lot of bad code
could be made smaller this way
* Moving embeded css into seperate file, where the same rules are used on
several pages.
* Check out advpng, shrinks PNG images down by few percent or so.

Ok!
 
J

Jim Higson

Sam said:
How does it compare to PNGOUT?

I just did a little test with the 9 small images you see at the top of my
client's page here:

http://www.masmodels.com/portfolio

advpng : 57.1k
pngout : 55.7k (97.5 of advpng)

so pngout is *slightly* better

However, I sometimes run my scripts an a Linux/PPC computer so programs
distributed as i386 binary only are not much use to me.

About pngout - I don't think I'll use it, personally I don't much like
software I can't modify, and like even less being directed to a 38k HTML
file (plus images), where I am asked to wait in line for a 28k download!
 
E

Errol Smith

I have a better idea of what you're trying to do now. I quite like the idea.
I don't think very well made pages could be shrunk much, but for some guy's
homepage you might be onto something. Some ideas:

That's the aim. People who hand-optimise their pages probably won't
gain much but this is for lazy people like me :)
* Replacing class and id names with single letter identifiers in the html
and css? Might not save much if the file is gzipped since they're repeated
strings anyway, but might be worth a few bytes. Will also make the code
harder to read, so personally I would avoid.
* Replacing long URLs to pages on the same site with hrefs to symlinks on
the server, with much smaller names? Static pages only I'm affraid.
* Lossy PNG compression (google for it!) and conversion of PNGs to indexed

I hadn't thought of most of these things. Although the idea of
re-coding a whole site using smaller URL's has tempted me but that
makes the site harder to use, as people (and search engines) look at
the URL to get an idea of where they are.
* stripping comments, lf and cr. I don't like this much becasue I think you
should be able to look at the html of a site, but would save a little
space.

This is what my original post was asking about! :) I think it's OK,
as long as you are only interested in saving space to do this. You can
always run htmltidy on the code to make it readable again.
* A thumbnail maker that makes the thumbs from lossless versions of the
artwork, not the published jpeg version so images aren't compressed twice.
I do this on sites I create, my thumbs are a *little* smaller and a *tiny*
bit higher quality because of it
* automatic replacing of img tags with objects, where it is more compact, or
with divs and css where the image isn't content. Not sure how you could
tell, mind.
* Decision-tree induction to convert font tags to css. A lot of bad code
could be made smaller this way
* Moving embeded css into seperate file, where the same rules are used on
several pages.

More ideas! I will add them to my "investigate further" pile! thanks
* Check out advpng, shrinks PNG images down by few percent or so.

I'm just downloading it and will see how it compares with pngcrush
when I get a bit more time.
Thanks for your many suggestions!



Errol Smith
errol <at> ros (dot) com [period] au
 
S

Sam Hughes

I'm just downloading it and will see how it compares with pngcrush
when I get a bit more time.

Compare it to PNGOUT, why not, which usually outcompresses pngcrush, uymv.
 
E

Errol Smith

check out Perl's HTLM::Clean it already does much of what you are (maybe)
trying to do. Anyone on an Apache server can use it as a filter for dynamic
content, or apply it offline for static pages.

http://www.perl.com/pub/a/2003/04/17/filters.html

I've given HTML::Clean a try and it produces more compact html than
htmlcrunch, BUT it introduced some errors in a couple of test files
that htmlcrunch processed OK, so it's not perfect. (I am not really
surprised because it is about 4 years old now since the last update).
I will look into it but Perl is not a language I am familiar with so
I don't know what I'll be able to achieve. (the author hasn't
responded to my email).


Errol Smith
errol <at> ros (dot) com [period] au
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,431
Messages
2,571,679
Members
48,796
Latest member
Greg L.

Latest Threads

Top