Reading an HTML document & extracting content

C

Cognizance

Hi gang,

I'm an ASP developer by trade, but I've had to create client side
scripts with JavaScript many times in the past. Simple things, like
validating form elements and such.

Now I've been assigned the task of extracting content from a given HTML
page. If anyone's familiar with the Yahoo! Store order confirmation
screen, I need to be able to grab the total amount from the table to
the right-hand side. (Sample File:
http://www.2beyourself.com/t/sample.html)

If you view the source, this is in a table and enclosed with ugly html.
the value I want to retrieve is wrapped with b tags. Originally I was
thinking of using innerHTML or innerText for extracting the value. But
I find that we cannot gain control of this piece of the Yahoo! Store to
make it work!

So after talking with peers, we thought of reading in the entire HTML
page and using regular expressions to try and extract the value.
Something along the lines of: '\<b\>[0-9]+\.[0-9]{2}\<\/b\/>'

I'm not sure how to accomplish this. Could someone please point me in
the right direction? If this solution is even a good one. If you have
something better, I'm all ears! (eyes) If using the regular expression
would be a good solution, I need to find out how to read in the entire
HTML doc, and then parse out that piece.

Any tips and suggestions will be appreciate greatly!!

And I hope your week is starting off right. ^^
 
M

McKirahan

Cognizance said:
Hi gang,

I'm an ASP developer by trade, but I've had to create client side
scripts with JavaScript many times in the past. Simple things, like
validating form elements and such.

Now I've been assigned the task of extracting content from a given HTML
page. If anyone's familiar with the Yahoo! Store order confirmation
screen, I need to be able to grab the total amount from the table to
the right-hand side. (Sample File:
http://www.2beyourself.com/t/sample.html)

If you view the source, this is in a table and enclosed with ugly html.
the value I want to retrieve is wrapped with b tags. Originally I was
thinking of using innerHTML or innerText for extracting the value. But
I find that we cannot gain control of this piece of the Yahoo! Store to
make it work!

So after talking with peers, we thought of reading in the entire HTML
page and using regular expressions to try and extract the value.
Something along the lines of: '\<b\>[0-9]+\.[0-9]{2}\<\/b\/>'

I'm not sure how to accomplish this. Could someone please point me in
the right direction? If this solution is even a good one. If you have
something better, I'm all ears! (eyes) If using the regular expression
would be a good solution, I need to find out how to read in the entire
HTML doc, and then parse out that piece.

Any tips and suggestions will be appreciate greatly!!

And I hope your week is starting off right. ^^

RegEx would be better but this works:

<html>
<head>
<title>Total.htm</title>
<script type="text/javascript">
function total() {
var sURL = "http://www.2beyourself.com/t/sample.html";
var oXML = new ActiveXObject("Microsoft.XMLHTTP");
oXML.Open("GET",sURL,false);
oXML.send();
try {
var sXML = oXML.ResponseText;
// Find Total's label
var iTAG = sXML.indexOf("<b>Total:</b>");
var sVAL = sXML.substr(iTAG);
// Find Total's decimal
var iDOT = sVAL.indexOf(".");
sVAL = sVAL.substr(0,iDOT+3);
// Find Total's start
iTAG = sVAL.lastIndexOf(">")
sVAL = sVAL.substr(iTAG+1)
// Show Total's value
alert(sVAL);
} catch(e) {
alert(sURL + " not found!");
}
}
</script>
</head>
<body onload="total()">
</body>
</html>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top