Tough Regular Expression problem

Bryan · Nov 8, 2004

Hi All:

I'm trying to find the right Regexp string to remove empty SPAN tags
from an HTML string.

Say I have a string like so, and I want to remove the empty span tags:

This is my text

A simple expression like this /(.*)?<\/SPAN>/gi will give me the
text between the two span tags, which I can then use in a replace
statement.

This gets much more complicated when we have nested tags, however.
For example:

one two three four five

What I really want after the replace statement is this:

one two three four five

I'm having trouble crafting the perfect expression for this. I can't
seem to get my head around the right solution to handle the greedy vs
non-greedy thing, and not eliminate the wrong closing tag.

Is this even possible with straight expressions?

Thanks in advance for any help you can provide!

Bryan

J. J. Cale · Nov 10, 2004

Bryan said:
Hi All:

I'm trying to find the right Regexp string to remove empty SPAN tags
from an HTML string.

if you need to remove the element try the DOM
and specifially the childNodes collection

This gets much more complicated when we have nested tags, however.
For example:
one two three four five

is the containing element
a node of nodeType element. (obj.nodeType = 1)
First you need a reference to the containing span. Either find it via the
DOM tree or give it a specific id <span id="anId" and use
var oRef = document.getElementById('anId');
or whatever you wish to support.
one is a text node type 3 oRef.childNodes[0] or oRef.firstChild
oRef.childNodes[0].nodeValue is 'one'
oRef.childNodes[1] is the next span element (type 1) containing
oRef.childNodes[1].firstChild the textNode containing 'two'
From here there are a number of ways to deal with this.

What I really want after the replace statement is this:
one two three four five

Create a new text node, insert it before the span
you want to delete and delete the span.
Or clone the spanToDelete.firstChild node, insert it.
before the span to delete and delete the span.
Or, copy the span.firstChild.nodeValue, delete the span
and append the copied text to the firstSpan.firstChild.nodeValue
and other possibilities
Google for DOM Level 2 to see how to do these things correctly.
Hope this helps
Jimbo

Bryan · Nov 10, 2004

J. J. Cale wrote...

if you need to remove the element try the DOM
and specifially the childNodes collection

Huh. That's an interesting idea. A little more complicated than a
regexp replace, but it should work. If I can come up with something
that's cross-browser, I might be able to use that approach.

Thanks for the idea.

Thomas 'PointedEars' Lahn · Dec 12, 2004

Bryan said:
[...]
A simple expression like this /(.*)?<\/SPAN>/gi will give me the
text between the two span tags, which I can then use in a replace
statement.

This gets much more complicated when we have nested tags, however.
For example:

one two three four five

What I really want after the replace statement is this:

one two three four five

I'm having trouble crafting the perfect expression for this. I can't
seem to get my head around the right solution to handle the greedy vs
non-greedy thing, and not eliminate the wrong closing tag.

Is this even possible with straight expressions?

No, it is not, by design; or let us say it is not generally possible --
enough constraints provided (such as that `span' elements may not nest,
in opposition to the HTML specifications), it may be possible (which
is why removeTags() exists in my JSX:string.js, BTW).

AIUI, Regular Expressions require either a DFA or a NFA or both of them
to be matched against a text (that said, know that because ECMAScript
implementations like JavaScript and JScript support PCRE alternation,
they must be using either a NFA or a combination of DFA and NFA to
match RegExps). However, to parse arbitrary occurrences of open and
matching close tags, i.e. to recognize a program in a (deterministic)
context-free language, you require a (N)PDA (which could be implemented
as a markup parser to build a parse tree which indeed is done in common
HTML UAs) [1].

See Jeffrey E. F. Friedl, Mastering Regular Expressions, chapter 4,
section 'Multi-Character "Quotes"' pp., available online at
<http://www.oreilly.com/catalog/regex/chapter/ch04.html> for
further information and possible solutions.

PointedEars
___________
[1] It has been a while since my lectures in automata theory, please CMIIW.

Why is the e.target not working here?	1	Dec 29, 2022
Login form no longer working	2	Feb 18, 2023
Slideshow not working properly	2	Jan 7, 2023
Positioning CSS components	1	Nov 16, 2023
Help with Regular Expression	1	Apr 6, 2008
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Help with code	0	Jun 12, 2022
Problem with Regular Expression	3	Jan 29, 2007

Tough Regular Expression problem

Bryan

J. J. Cale

Bryan

Thomas 'PointedEars' Lahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads