emacs lisp text processing example (html5 figure/figcaption)

Xah Lee · Jul 4, 2011

OMG, emacs lisp beats perl/python again!

Hiya all, another little emacs lisp tutorial from the tiny Xah's Edu
Corner.

ã€ˆEmacs Lisp: Processing HTML: Transform Tags to HTML5 â€œfigureâ€ and
â€œfigcaptionâ€ Tagsã€‰
xahlee.org/emacs/elisp_batch_html5_tag_transform.html

plain text version follows.

------------------------------------------

Emacs Lisp: Processing HTML: Transform Tags to HTML5 â€œfigureâ€ and
â€œfigcaptionâ€ Tags

Xah Lee, 2011-07-03

Another triumph of using elisp for text processing over perl/python.

----------------------------
The Problem

--------------
Summary

I want batch transform the image tags in 5 thousand html files to use
HTML5's new â€œfigureâ€ and â€œfigcaptionâ€ tags.

I want to be able to view each change interactively, while optionally
give it a â€œgo aheadâ€ to do the whole job in batch.

Interactive eye-ball verification on many cases lets me be reasonably
sure the transform is done correctly. Yet i don't want to spend days
to think/write/test a mathematically correct program that otherwise
can be finished in 30 min with human interaction.

--------------
Detail

HTML5 has the following new tags: â€œfigureâ€ and â€œfigcaptionâ€. They are
used like this:

<figure>
<img src="cat.jpg" alt="my cat" width="167" height="106">
<figcaption>my cat!</figcaption>
</figure>

(For detail, see: HTML5 â€œfigureâ€ ï¼† â€œfigurecaptionâ€ Tags Browser
Support)

On my website, i used a similar structure. They look like this:

<div class="img">
<img src="cat.jpg" alt="my cat" width="167" height="106">
<p class="cpt">my cat!</p>
</div>

So, i want to replace them with the HTML5's new tags. This can be done
with a regex. Here's the â€œfindâ€ regex:

<div class="img">
?<img src="$[^.]+?$\.jpg" alt="$[^"]+?$" width="$[0-9]+?$"
height="$[0-9]+?$">?
<p class="cpt">$[^<]+?$</p>
?</div>

Here's the replacement string:

<figure>
<img src="\1.jpg" alt="\2" width="\3" height="\4">
<figcaption>\5</figcaption>
</figure>

Then, you can use â€œfind-fileâ€ and dired's â€œdired-do-query-replace-
regexpâ€ to work on your 5 thousand pages. Nice. (See: Emacs:
Interactively Find ï¼† Replace String Patterns on Multiple Files.)

However, the problem here is more complicated. The image file may be
jpg or png or gif. Also, there may be more than one image per group.
Also, the caption part may also contain complicated html. Here's some
examples:

<div class="img">
<img src="cat1.jpg" alt="my cat" width="200" height="200">
<img src="cat2.jpg" alt="my cat" width="200" height="200">
<p class="cpt">my 2 cats</p>
</div>

<div class="img">
<img src="jamie_cat.jpg" alt="jamie's cat" width="167" height="106">
<p class="cpt">jamie's cat! Her blog is <a href="http://example.com/
jamie/">http://example.com/jamie/</a></p>
</div>

So, a solution by regex is out.

----------------------------
Solution

The solution is pretty simple. Here's the major steps:

Use â€œfind-lisp-find-filesâ€ to traverse a dir.
For each file, open it.
Search for the string <div class="img">
Use â€œsgml-skip-tag-forwardâ€ to jump to its closing tag.
Save the positions of these tag begin/end positions.
Ask user if she wants to replace. If so, do it. (using â€œdelete-
regionâ€ and â€œinsertâ€)
Repeat.

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-07-03
;; replace image tags to use html5's â€œfigureâ€ and â€œfigcaptionâ€ tags.

;; Example. This:
;; <div class="img">â€¦</div>
;; should become this
;; <figure>â€¦</figure>

;; do this for all files in a dir.

;; rough steps:
;; find the <div class="img">
;; use sgml-skip-tag-forward to move to the ending tag.
;; save their positions.

(defun my-process-file (fpath)
"process the file at fullpath FPATH ..."
(let (mybuff p1 p2 p3 p4 )
(setq mybuff (find-file fpath))

(widen)
(goto-char 0) ;; in case buffer already open

(while (search-forward "<div class=\"img\">" nil t)
(progn
(setq p2 (point) )
(backward-char 17) ; beginning of â€œdivâ€ tag
(setq p1 (point) )

(forward-char 1)
(sgml-skip-tag-forward 1) ; move to the closing tag
(setq p4 (point) )
(backward-char 6) ; beginning of the closing div tag
(setq p3 (point) )
(narrow-to-region p1 p4)

(when (y-or-n-p "replace?")
(progn
(delete-region p3 p4 )
(goto-char p3)
(insert "</figure>")

(delete-region p1 p2 )
(goto-char p1)
(insert "<figure>")
(widen) ) ) ) )

(when (not (buffer-modified-p mybuff)) (kill-buffer mybuff) )

) )

(require 'find-lisp)

(let (outputBuffer)
(setq outputBuffer "*xah img/figure replace output*" )
(with-output-to-temp-buffer outputBuffer
(mapc 'my-process-file (find-lisp-find-files "~/web/xahlee_org/
emacs/" "\\.html$"))
(princ "Done deal!")
) )

Seems pretty simple right?

The â€œp1â€ and â€œp2â€ variables are the positions of start/end of <div
class="img">. The â€œp3â€ and â€œp4â€ is the start/end of it's closing tag </
div>.

We also used a little trick with â€œwidenâ€ and â€œnarrow-to-regionâ€. It
lets me see just the part that i'm interested. It narrows to the
beginning/end of the div.img. This makes eye-balling a bit easier.

The real time-saver is the â€œsgml-skip-tag-forwardâ€ functionfrom â€œhtml-
modeâ€. Without that, one'd have to write a mini-parser to deal with
html's nested ways to be able to locate the proper ending tag.

Using the above code, i can comfortably eye-ball and press â€œyâ€ at the
rate of about 5 per second. That makes 300 replacements per minute. I
have 5000+ files. If we presume there are 6k replacement to be made,
then at 5 per second means 20 minutes sitting there pressing â€œyâ€.
Quite tiresome.

So, now, the next step is simply to remove the asking (y-or-n-p
"replace?"). Or, if i'm absolutely paranoid, i can make emacs write
into a log buffer for every replacement it makes (together with the
file path). When the batch replacement is done (probably under 3
minutes), i can simply scan thru the log to see if any replacement
went wrong. For how to do that, see: Emacs Lisp: Multi-Pair String
Replacement with Report.

But what about replacing <p class="cpt">â€¦</p> with <figcaption>â€¦</
figcaption>?

I simply copy and pasted the above code into a new file, just made
changes in 4 places. So, the replacing figcaption part is considered a
separete batch job. Of course, one could spend extra hour or so to
make the code do them both in one pass, but is that one extra hour of
thinking ï¼† coding worthwhile for this one-time job?

I â™¥ Emacs, do you?

---------------------------------

PS perl and python solution welcome. I haven't looked at perl or
python's html parser libs for 5+ years.

Though, 2 little requirement:

1. it must be correct, of course. Cannot tolerate the possiblility
that maybe one out of a thousand replacement it introduced a
mismatched tag. (but you can assume that all the input html files are
w3c valid)

2. it must not change the formatting of the html pages. i.e. adding/
removing spaces or tabs.

Xah

S.Mandl · Jul 4, 2011

Nice. I guess that XSLT would be another (the official) approach for
such a task.
Is there an XSLT-engine for Emacs?

-- Stefan

Ian Kelly · Jul 5, 2011

So, a solution by regex is out.

Actually, none of the complications you listed appear to exclude
regexes. Here's a possible (untested) solution:

<div class="img">
((?:\s*<img src="[^.]+\.(?:jpg|png|gif)" alt="[^"]+" width="[0-9]+"
height="[0-9]+">)+)
\s*<p class="cpt">((?:[^<]|<(?!/p>))+)</p>
\s*</div>

and corresponding replacement string:

<figure>
\1
<figcaption>\2</figcaption>
</figure>

I don't know what dialect Emacs uses for regexes; the above is the
Python re dialect. I assume it is translatable. If not, then the
above should at least work with other editors, such as Komodo's
"Find/Replace in Files" command. I kept the line breaks here for
readability, but for completeness they should be stripped out of the
final regex.

The possibility of nested HTML in the caption is allowed for by using
a negative look-ahead assertion to accept any tag except a closing
</p>. It would break if you had nested <p> tags, but then that would
be invalid html anyway.

Cheers,
Ian

Xah Lee · Jul 5, 2011

So, a solution by regex is out.

Click to expand...

Actually, none of the complications you listed appear to exclude
regexes. Here's a possible (untested) solution:

<div class="img">
((?:\s*<img src="[^.]+\.(?:jpg|png|gif)" alt="[^"]+" width="[0-9]+"
height="[0-9]+">)+)
\s*<p class="cpt">((?:[^<]|<(?!/p>))+)</p>
\s*</div>

and corresponding replacement string:

<figure>
\1
<figcaption>\2</figcaption>
</figure>

I don't know what dialect Emacs uses for regexes; the above is the
Python re dialect. I assume it is translatable. If not, then the
above should at least work with other editors, such as Komodo's
"Find/Replace in Files" command. I kept the line breaks here for
readability, but for completeness they should be stripped out of the
final regex.

The possibility of nested HTML in the caption is allowed for by using
a negative look-ahead assertion to accept any tag except a closing
</p>. It would break if you had nested <p> tags, but then that would
be invalid html anyway.

Cheers,
Ian

that's fantastic. Thanks! I'll try it out.

Xah

Xah Lee · Jul 5, 2011

Nice. I guess that XSLT would be another (the official) approach for
such a task.
Is there an XSLT-engine for Emacs?

-- Stefan

haven't used XSLT, and don't know if there's one in emacs...

it'd be nice if someone actually give a example...

Xah

Xah Lee · Jul 5, 2011

So, a solution by regex is out.

Click to expand...

Actually, none of the complications you listed appear to exclude
regexes. Â Here's a possible (untested) solution:

<div class="img">
((?:\s*<img src="[^.]+\.(?:jpg|png|gif)" alt="[^"]+" width="[0-9]+"
height="[0-9]+">)+)
\s*<p class="cpt">((?:[^<]|<(?!/p>))+)</p>
\s*</div>

and corresponding replacement string:

<figure>
\1
<figcaption>\2</figcaption>
</figure>

I don't know what dialect Emacs uses for regexes; the above is the
Python re dialect. Â I assume it is translatable. Â If not, then the
above should at least work with other editors, such as Komodo's
"Find/Replace in Files" command. Â I kept the line breaks here for
readability, but for completeness they should be stripped out of the
final regex.

The possibility of nested HTML in the caption is allowed for by using
a negative look-ahead assertion to accept any tag except a closing
</p>. Â It would break if you had nested <p> tags, but then that would
be invalid html anyway.

Cheers,
Ian

emacs regex supports shygroup (the ã€Œ(?:â€¦)ã€) but it doesn't support the
negative assertion ã€Œ?!â€¦ã€ though.

but in anycase, i can't see how this part would work
<p class="cpt">((?:[^<]|<(?!/p>))+)</p>

?

Xah

S.Mandl · Jul 5, 2011

haven't used XSLT, and don't know if there's one in emacs...

it'd be nice if someone actually give a example...

Hi Xah, actually I have to correct myself. HTML is not XML. If it
were, you
could use a stylesheet like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="p[@class='cpt']">
<figcaption>
<xsl:value-of select="."/>
</figcaption>
</xsl:template>

<xsl:template match="div[@class='img']">
<figure>
<xsl:apply-templates select="@*|node()"/>
</figure>
</xsl:template>

<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

which applied to this document:

<?xml version="1.0" encoding="ISO-8859-1"?>
<doc>
<h1>Just having fun</h1>with all the
<div class="img">
<img src="cat1.jpg" alt="my cat" width="200" height="200"/>
<img src="cat2.jpg" alt="my cat" width="200" height="200"/>
<p class="cpt">my 2 cats</p>
</div>
cats here:
<h1>Just fooling around</h1>
<p class="cpt">jamie's cat! Her blog is <a href="http://example.com/
jamie/">http://example.com/jamie/</a></p>
</div>
</doc>

would yield:

<?xml version="1.0"?>
<doc>
<h1>Just having fun</h1>with all the
<figure class="img">
<img src="cat1.jpg" alt="my cat" width="200" height="200"/>
<img src="cat2.jpg" alt="my cat" width="200" height="200"/>
<figcaption>my 2 cats</figcaption>
</figure>
cats here:
<h1>Just fooling around</h1>
<figcaption>jamie's cat! Her blog is http://example.com/jamie/</figcaption>
</figure>
</doc>

But well, as you don't have XML as input ... there really was no point
to my remark.

Best,
Stefan

Ian Kelly · Jul 5, 2011

but in anycase, i can't see how this part would work
<p class="cpt">((?:[^<]|<(?!/p>))+)</p>

It's not that different from the pattern ã€Œalt="[^"]+"ã€ earlier in the
regex. The capture group accepts one or more characters that either
aren't '<', or that are '<' but are not immediately followed by '/p>'.
Thus it stops capturing when it sees exactly '</p>' without consuming
the '<'. Using my regex with the example that you posted earlier
demonstrates that it works:
.... <img src="jamie_cat.jpg" alt="jamie's cat" width="167" height="106">
.... <p class="cpt">jamie's cat! Her blog is <a href="http://example.com/
<figure>
<img src="jamie_cat.jpg" alt="jamie's cat" width="167" height="106">
<figcaption>jamie's cat! Her blog is <a href="http://example.com/
jamie/">http://example.com/jamie/</a></figcaption>
</figure>

Cheers,
Ian

emacs lisp as text processing language...	1	Oct 29, 2007
<div> help	1	Mar 4, 2024
Troubles with Fullpage / please help	0	Dec 14, 2023
Stuck with html and css	25	Dec 14, 2022
Flip-Cards with Local Images	1	Mar 27, 2023
Help with Visual Lightbox: Scripts	2	May 3, 2023
Problem in getting dashboard page from login page in python pycharm using POST command	0	Dec 24, 2022
Clickable Div Block	1	Oct 13, 2023

emacs lisp text processing example (html5 figure/figcaption)

Xah Lee

S.Mandl

Ian Kelly

Xah Lee

Xah Lee

Xah Lee

S.Mandl

Ian Kelly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads