Navigating text string that contains HTML of a page as DOM object?

A

Alex

Hello.

First, with AJAX I will get a remote web page into a string. Thus, a
string will contain HTML tags and such. I will need to extract text
from one <span> for which I know the ID the inner text.

Is it possible to access in this way "string variable".getElementByID()
somehow?

Thank you.

PS: Just thinking of a proper/efficient way to extract the information
from such a string. I am open to other ideas. I could load that page in
IFRAME and get my access to DOM that way, yet probably it is not an
eligant solution.

Thank you again.
 
T

Thomas 'PointedEars' Lahn

Alex said:
First, with AJAX I will get a remote web page into a string. Thus, a
string will contain HTML tags and such. I will need to extract text
from one <span> for which I know the ID the inner text.

Is it possible to access in this way "string variable".getElementByID()
somehow?

Provided that the `span' element contains no other elements:

var m = x.responseText.match(
/<span [^>]*\bid="yourID"[^>]*>([^<]+)<\/span>/i);
if (m)
{
var spanText = m[1];
}

This is the second time you ask about something that can be solved with
Regular Expressions. Please RTFineM on that:

<URL:http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Reference:Global_Objects:RegExp>
<URL:http://msdn.microsoft.com/library/en-us/jscript7/html/jsobjregexpression.asp>
PS: Just thinking of a proper/efficient way to extract the information
from such a string. I am open to other ideas. I could load that page in
IFRAME and get my access to DOM that way, yet probably it is not an
eligant solution.

You could also serve XML (text/xml) and use getElementById()
on the Document object XMLHTTPRequest::responseXML refers to.
Quickhack:

var d = x.responseXML;
if (d)
{
var span = d.getElementById("yourID"), spanText;
if (span)
{
if (span.textContent)
{
// W3C DOM Level 3 Core
spanText = span.textContent;
}
else if (span.innerText)
{
// proprietary
spanText = span.innerText;
}
else if (span.innerHTML)
{
// proprietary
spanText = span.innerHTML;
}
}
}


PointedEars
 
R

RobG

Alex said on 21/03/2006 1:49 PM AEST:
Hello.

First, with AJAX I will get a remote web page into a string. Thus, a
string will contain HTML tags and such. I will need to extract text
from one <span> for which I know the ID the inner text.

Is it possible to access in this way "string variable".getElementByID()
somehow?

The first question is why are you sending more HTML to the client than
is necessary? But supposing you have a good reason for that...

There are two ways, one is to use a regular expression only, the other
to use a RegExp in concert with innerHTML and getElementById. Whichever
is best is up to you.

The first method is kinda quick 'n dirty but may suite. The second
method is a bit more general and may be better where you want to access
multiple elements, but it still has its failings. It processes the
HTML, creates a new div element, sets its style.display property to
'none', injects the processed HTML as the div's innerHTML, then uses
getElementById to get the element and its content.

You may not bother with all of it depending on where you sourced the
HTML from:

1. Strip stuff outside body - necessary, must do else will have
invalid HTML.

2. Remove script tags & content - necessary to cut down
on bulk and stop script executing when added to document
or from interfering with other scripts

3. Remove img tags - no point in downloading images

4. Replace onload attribute with onclick to stop script
executing onload - may not be necessary, if text 'onload'
appears in document text it will be altered too.


Watch for wrapping, I've tried to avoid it.


<script type="text/javascript">

var HTMLstring = [
'<html><head><title>The title</title></head><body>',
'<script type="text/javascript">function b(){ }<\/script>',
'<p onload="blah">A para<span id="xx"><i><b>Content of </b>',
'xx</i></span> more para</p>',
'<img src="reallyBigImg.jpg" alt="ha ha">',
'<img src="reallyBigImg.jpg" alt="ha ha">',
'<p onload = "blah" id="b">A para<span id="yy">Content <b><i>',
'of</i></b> yy</span> more para</p>',
'<script type="text/javascript">function c(){ }<\/script>',
'</body></html>'].join('');

// Straight RegExp and replace
function getInnerTextRE(id)
{
var reS = new RegExp('.*<span[^>]*\\b' + id + '\\b[^>]*>','i');
var reE = new RegExp('<\/span>.*','i');
alert( id + ': ' +
HTMLstring.replace(reS,'').replace(reE,'').replace(/<[^>]*>/g,'')
);
}

// RegExp, innerHTML and getElementById
function getInnerText(id)
{

// Remove everything outside body tags, including the body tags
HTMLstring = HTMLstring.replace(/.*<body[^>]*>/i,'')
HTMLstring = HTMLstring.replace(/<\/body>.*/i,'');

// Remove script tags & content (wrapped for posting)
HTMLstring =
HTMLstring.replace(/<script[^>]*>[^<>]*<\/script>/ig,'');

// Remove image tags
HTMLstring = HTMLstring.replace(/<img[^>]*>/ig,'');

// Replace onload attribute with onclick to stop them executing
HTMLstring = HTMLstring.replace(/onload/g,'onclick');

var d = document.createElement('div');
d.style.display = 'none';
d.innerHTML = HTMLstring;
document.body.appendChild(d);

alert( id + ': ' + getText(id));

document.body.removeChild(d);
}

function getText(id)
{
var el;
if ( document.getElementById
&& (el = document.getElementById(id))){
if (el.textContent) return el.textContent;
if (el.innerText) return el.innerText;
return el.innerHTML.replace(/<[^>]*>/g,'');
}
}

</script>

<button onclick="getInnerText('xx');getInnerText('yy');">
Get text using RegExp & getElementById</button>

<button onclick="getInnerTextRE('xx');getInnerTextRE('yy');">
Get text using regular expression only</button>
 
A

Alex

I think I will go with responseXML. Regular Expressions is hard for me
to debug since I still have not learned them. Plus, I think responseXML
will be less CPU intensive task.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top