    Elisp Tutorial: HTML Syntax Coloring Code Block

    Xah Lee, 2007-10

    This page shows a example of writing a emacs lisp function that
    process a block of text to syntax color it by HTML tags. If you don't
    know elisp, first take a gander at Emacs Lisp Basics.

    HTML version with color and links is at:



    I want to write a elisp function, such that when invoked, the block of
    text the cursor is on, will have various HTML style tags wrapped
    around them. This is for the purpose of publishing programing language
    code in HTML on the web.


    I write a lot computer programing tutorials for several computer
    languages. For example: Perl and Python tutorial, Java tutorial, Emacs
    Lisp tutorial, Javascript tutorial. In these tutorials, often there
    are code snippets. These code need to be syntax colored in HTML.

    For example, here's a elisp code snippet:

    (if (< 3 2) (message "yes") )

    Here's what i actually want as raw HTML:

    (<span class="keyword">if</span> (&lt; 3 2) (message <span
    class="string">"yes"</span>) )

    Which should looks like this in a web browser:

    (if (< 3 2) (message "yes") )

    There is a emacs package that turns a syntax-colored text in emacs to
    HTML form. This is extremely nice. The package is called htmlize.el
    and is written (1997,...,2006) by Hrvoje Niksic, available at

    This program provides you with a few new emacs commands. Primarily, it
    has htmlize-region, htmlize-buffer, htmlize-file. The region and
    buffer commands will output HTML code in a new buffer, and the htmlize-
    file version will take a input file name and output into a file.

    When i need to include a code snippet in my tutorial, typically, i
    write the code in a separate file (e.g. “temp.javaâ€, “temp.pyâ€), run
    it to make sure the code is correct (compile, if necessary), then,
    copy the file into the HTML tutorial page, inside a «pre» block. In
    this scheme, the best way for me to utilize htmlize.el program is to
    use the “html-buffer†command on my temp.java, then copy the htmlized
    output and paste that into my HTML tutorial file inside a «pre» block.
    Since many of my tutorials are written haphazardly over the years
    before seeing the need for syntax coloration, most exist inside «pre»
    tags already without a temp code file. So, in most cases, what i do is
    to select the text inside the «pre» tag, paste into a temp bufferand
    invoke the right mode for the language (so the text will be fontified
    correctly), then do htmlize-buffer, then copy the html output, then
    paste back to replace the selected text.

    This process is tedious. A tutorial page will have several code
    blocks. For each, i will need to select text, create a buffer, switch
    mode, do htmlize, select again, switch buffer, then paste. Many of the
    steps are not pure push-buttons operations but involves eye-balling.
    There are few hundred such pages.

    It would be better, if i can place the cursor on a code block in a
    existing HTML page, then press a button, and have emacs magically
    replace the code block with htmlized version colorized for the code
    block's language. We proceed to write this function.


    For a elisp expert who knows how fontification works in emacs, the
    solution would be writing a elisp code that maps emacs's string's
    fontification info into html tags. This is what htmlize.el do exactly.
    Since it is already written, a elisp expert might find the essential
    code in htmlize.el. (the code is licensed under GPL) .

    Unfortunately, my lisp experience isn't so great. I spent maybe 30
    minutes tried to look in htmlize.html in hope to find a function
    something like htmlize-str that is the essence, but wasn't successful.
    I figured, it is actually faster if i took the dumb and inefficient
    approach, by writing a elisp code that extracts the output from
    htmlize-buffer. Here's the outline of the plan of my function:

    * 1. Grab the text inside a <pre class="«lang»">...</pre> tag.
    * 2. Create a new buffer. Paste the code in.
    * 3. Make the new buffer «lang» mode (and fontify it)
    * 4. Call htmlize-buffer
    * 5. Grab the (htmlized) text inside «pre» tag in the htmlize
    created output buffer.
    * 6. Kill the htmlize buffer and my temp buffer.
    * 7. Delete the original text, paste in the new text.

    To achieve the above, i decided on 2 steps. A: Write a function
    “htmlize-string†that takes a string and mode name, and returns the
    htmlized string. B: Write a function “htmlize-block†that does the
    steps of grabbing text and pasting, and calls “htmlize-string†for the
    actual htmlization.

    Here's the code of my htmlize-string function:

    (defun htmlize-string (ccode mn)
    "Take string ccode and return htmlized code, using mode mn.\n
    This function requries the htmlize-mode.el by Hrvoje Niksic, 2006"
    (let (cur-buf temp-buf temp-buf2 x1 x2 resultS)
    (setq cur-buf (buffer-name))
    (setq temp-buf "xout-weewee")
    (setq temp-buf2 "*html*") ;; the buffer that htmlize-buffer

    ; put the code in a new buffer, set the mode
    (switch-to-buffer temp-buf)
    (insert ccode)
    (funcall (intern mn))

    (htmlize-buffer temp-buf)
    (kill-buffer temp-buf)
    (switch-to-buffer temp-buf2)

    ; extract the core code
    (setq x1 (re-search-forward "<pre>"))
    (setq x1 (+ x1 1))
    (re-search-forward "</pre>")
    (setq x2 (re-search-backward "</pre>"))
    (setq resultS (buffer-substring-no-properties x1 x2))
    (kill-buffer temp-buf2)

    (switch-to-buffer cur-buf)

    The major part in this code is knowing how to create, switch, kill
    buffers. Then, how to set a mode. Lastly, how to grab text in a

    Current buffer is given by “buffer-nameâ€. To create or switch buffer
    is done by “switch-to-bufferâ€. Kill buffer is “kill-bufferâ€. To
    activate a mode, the code is “(funcall (intern my-mode-name))â€. I
    don't know why this is so in detail, but it is interesting to know.

    The grabbing text is done by locating the desired beginning and ending
    locations using re-search functions, and buffer-substring-no-
    properties for actually extracting the string.

    Here, note the “no-properties†in “buffer-substring-no-propertiesâ€.
    Emacs's string can contain information called properties, which is
    essentially the fontification information.

    Reference: Elisp Manual: Buffers.

    Reference: Elisp Manual: Text-Properties.

    Here's the code of my htmlize-block function:

    (defun htmlize-block ()
    "Replace the region enclosed by <pre> tag to htmlized code.
    For example, if the cursor somewhere inside the tag:

    <pre cla ss=\"code\">

    after calling, the “codeXYZ...†block of text will be htmlized.
    That is, wrapped with many <span> tags.

    The opening tag must be of the form <pre cla ss=\"lang-str\">.
    The “lang-str†determines what emacs mode is used to colorize
    the code.
    This function requires htmlize.el by Hrvoje Niksic."

    (let (mycode tag-begin styclass code-begin code-end tag-end mymode)
    (setq tag-begin (re-search-backward "<pre class=\"\\([A-z-]+\\)
    (setq styclass (match-string 1))
    (setq code-begin (re-search-forward ">"))
    (re-search-forward "</pre>")
    (setq code-end (re-search-backward "<"))
    (setq tag-end (re-search-forward "</pre>"))
    (setq mycode (buffer-substring-no-properties code-begin code-end))
    ((equal styclass "elisp") (setq mymode "emacs-lisp-mode"))
    ((equal styclass "perl") (setq mymode "cperl-mode"))
    ((equal styclass "python") (setq mymode "python-mode"))
    ((equal styclass "java") (setq mymode "java-mode"))
    ((equal styclass "html") (setq mymode "html-mode"))
    ((equal styclass "haskell") (setq mymode "haskell-mode"))
    (delete-region code-begin code-end)
    (goto-char code-begin)
    (insert (htmlize-string mycode mymode))

    The steps of this function is to grab the text inside a «pre» block,
    call htmlize-string, then insert the result replacing text.

    Originally, i wrote the code to grab text by inside plain “<pre>...</
    pre>†tags, then use some heuristics to determine what language it is,
    then call htmlize-string with the mode-name passed to it. However,
    since my html pages already has the language information in the form
    of “<pre class="«lang»">...</pre>†(for CSS reasons), so, now i search
    text by that form, and use the “lang†part to determine a mode.

    Emacs is beautiful.


    The story given above is slightly simplified. For example, when i
    began my language notes and commentaries, they were not planned to be
    some systematic or sizable tutorial. As the pages grew, more quality
    are added in editorial process. So, a plain un-colored code inside
    «pre» started to have “language comment†strings colorized (e.g.
    “<span class="cmt">#...</span>), by using a simple elisp code that
    wraps a tag on them, and this function is mapped to shortcut key for
    easy execution. As pages and languages grew, i find colorizing comment
    isn't enough, then i started to look for a syntax-coloring html
    solution. There are solutions in Perl, Python, PHP, but I find emacs
    solution best suites my needs in particular because it integrates with
    emacs's interactive nature, and my writing work is done in a
    accumulative, editorial process.

    In the beginning i used htmlize-region and htmlize-buffer as they are
    for new code. Note that this is still a laborious process. Gradually i
    need to colorized my old code. The problem is that many already
    contain my own «span class="cmt"» tags, and strings common in computer
    languages such as “<=†have already been transformed intorequired
    html encoding “&lt;=â€. So, the elisp code will first “un-htmlizeâ€
    these in my htmlize-block code. But once all my existing code has been
    so newly colorized, the part of code to transform strings for un-
    htmlize is no longer necessary, so they are taken out in htmlize-block
    and resumes a cleaner state. Also, htmlize-block went thru many
    revisions over the year. Sometimes in recent past, i had one code
    wrapper for each language. For example, i had htmlize-me-perl, htmlize-
    me-python, htmlize-me-java, etc. The need for unification into a
    single coherent wrapper code didn't materialize. In general, it is my
    experience, in particular in writing elisp customization for emacs,
    that tweaking code periodically thru the year is practical, because it
    adapts to the constant changes of requirements, environment, work
    process. For example, eventually i might write my own htmlize.el, if i
    happen to need more flexibility, or if my elisp experience
    sufficiently makes the job relatively easy.

    Also note: a whole-sale solution is to write a program, in say,
    Python, that process html files and replace proper sections by
    htmlized string. This is perhaps more efficient if all the existing
    html files are in some uniform format. However, i need to work on my
    tutorials on a case-by-case basis. In part, because, some pages
    contain multiple languages or contains pseudo-code that i do not wish
    colorized. (For example, some pages contains codes of the Mathematica↗
    language. Mathematica code is normally done in Mathematica's
    mathematical typesetting capable “front-end†IDE called “Notebook†and
    is not “syntax-colored†as such.)


    ∑ http://xahlee.org/
    Xah Lee, Oct 18, 2007
