Alex said on 21/03/2006 1:49 PM AEST:
Hello.
First, with AJAX I will get a remote web page into a string. Thus, a
string will contain HTML tags and such. I will need to extract text
from one <span> for which I know the ID the inner text.
Is it possible to access in this way "string variable".getElementByID()
somehow?
The first question is why are you sending more HTML to the client than
is necessary? But supposing you have a good reason for that...
There are two ways, one is to use a regular expression only, the other
to use a RegExp in concert with innerHTML and getElementById. Whichever
is best is up to you.
The first method is kinda quick 'n dirty but may suite. The second
method is a bit more general and may be better where you want to access
multiple elements, but it still has its failings. It processes the
HTML, creates a new div element, sets its style.display property to
'none', injects the processed HTML as the div's innerHTML, then uses
getElementById to get the element and its content.
You may not bother with all of it depending on where you sourced the
HTML from:
1. Strip stuff outside body - necessary, must do else will have
invalid HTML.
2. Remove script tags & content - necessary to cut down
on bulk and stop script executing when added to document
or from interfering with other scripts
3. Remove img tags - no point in downloading images
4. Replace onload attribute with onclick to stop script
executing onload - may not be necessary, if text 'onload'
appears in document text it will be altered too.
Watch for wrapping, I've tried to avoid it.
<script type="text/javascript">
var HTMLstring = [
'<html><head><title>The title</title></head><body>',
'<script type="text/javascript">function b(){ }<\/script>',
'<p onload="blah">A para<span id="xx"><i><b>Content of </b>',
'xx</i></span> more para</p>',
'<img src="reallyBigImg.jpg" alt="ha ha">',
'<img src="reallyBigImg.jpg" alt="ha ha">',
'<p onload = "blah" id="b">A para<span id="yy">Content <b><i>',
'of</i></b> yy</span> more para</p>',
'<script type="text/javascript">function c(){ }<\/script>',
'</body></html>'].join('');
// Straight RegExp and replace
function getInnerTextRE(id)
{
var reS = new RegExp('.*<span[^>]*\\b' + id + '\\b[^>]*>','i');
var reE = new RegExp('<\/span>.*','i');
alert( id + ': ' +
HTMLstring.replace(reS,'').replace(reE,'').replace(/<[^>]*>/g,'')
);
}
// RegExp, innerHTML and getElementById
function getInnerText(id)
{
// Remove everything outside body tags, including the body tags
HTMLstring = HTMLstring.replace(/.*<body[^>]*>/i,'')
HTMLstring = HTMLstring.replace(/<\/body>.*/i,'');
// Remove script tags & content (wrapped for posting)
HTMLstring =
HTMLstring.replace(/<script[^>]*>[^<>]*<\/script>/ig,'');
// Remove image tags
HTMLstring = HTMLstring.replace(/<img[^>]*>/ig,'');
// Replace onload attribute with onclick to stop them executing
HTMLstring = HTMLstring.replace(/onload/g,'onclick');
var d = document.createElement('div');
d.style.display = 'none';
d.innerHTML = HTMLstring;
document.body.appendChild(d);
alert( id + ': ' + getText(id));
document.body.removeChild(d);
}
function getText(id)
{
var el;
if ( document.getElementById
&& (el = document.getElementById(id))){
if (el.textContent) return el.textContent;
if (el.innerText) return el.innerText;
return el.innerHTML.replace(/<[^>]*>/g,'');
}
}
</script>
<button onclick="getInnerText('xx');getInnerText('yy');">
Get text using RegExp & getElementById</button>
<button onclick="getInnerTextRE('xx');getInnerTextRE('yy');">
Get text using regular expression only</button>