Getting data off inconsistant web pages

D

Dudely

I have a situation where the data I want to pull off a website is
formatted at least 2 different ways that I'm currently aware of:
Sometimes I can pull the text like this:

userBody = IeDoc.getElementById("userbody").innerText

when the data is formatted like this:

<div id="userbody">
Beautiful Luxury Brick Home is West Cobb County. New; 4200 sq feet;
with 4 bedrooms, 3 1/2 BA, Living room, Dining room, Bonus Room, Den
(27x18),

However, there are times when the data is formatted like this:


<div id="userbody">
<div style="float: left; height: 1px; width: 1px; overflow:
hidden">nights Backed fraud acre Contact Making Leases care refrig
Take other Wood wooded closets elegant garden Moen extra</
div> You can Rent to Own any of our homes by
tomorrow!<br>


And I don't know how to access it. The data in this case starts with
"You can Rent to Own..." Actually, I might know how to access it, I
could simply look for the next div tag. The problem, is that I don't
know WHEN to look for that second div tag. Well, that's not quite
true either, I could in fact check to see if the next <div> has a
"footer" class, because that's what shows up next on the first example
data, but it seems like a kludge. I'm hoping there's a cleaner way -
such as perhaps being able to check to see if the div's overflow
attribute is hidden maybe?

Anyone know what I'm talking about and has a nice solution?

Thank you in advance.
 
D

David Mark

I have a situation where the data I want to pull off a website is
formatted at least 2 different ways that I'm currently aware of:
Sometimes I can pull the text like this:

userBody = IeDoc.getElementById("userbody").innerText

In IE only of course. Use the textContent property in standards-based
browsers.
when the data is formatted like this:

<div id="userbody">
Beautiful Luxury Brick Home is West Cobb County. New; 4200 sq feet;
with 4 bedrooms, 3 1/2 BA,  Living room, Dining room, Bonus Room, Den
(27x18),

However, there are times when the data is formatted like this:

<div id="userbody">
<div style="float: left; height: 1px; width: 1px; overflow:
hidden">nights Backed fraud acre Contact Making Leases care refrig
Take other Wood wooded closets elegant garden Moen extra</
div>                        You can Rent to Own any of our homes by
tomorrow!<br>

Ah, you misspelled "incompetent" in the subject.
And I don't know how to access it.  The data in this case starts with
"You can Rent to Own..."  Actually, I might know how to access it, I
could simply look for the next div tag.  The problem, is that I don't

You mean skip the "hidden" DIV element? Yes, you could iterate
through the child nodes and do whatever you want with them.
know WHEN to look for that second div tag.  Well, that's not quite

The Web "developer" could change the markup structure at any time.
true either, I could in fact check to see if the next <div> has a
"footer" class, because that's what shows up next on the first example
data, but it seems like a kludge.  I'm hoping there's a cleaner way -
such as perhaps being able to check to see if the div's overflow
attribute is hidden maybe?

There is no such thing as an "overflow attribute." If you meant the
overflow style, that has no bearing here either. You could check to
see if the computed height and width are 1 and the display is
"block" (that's what the float:left nonsense accomplishes), but then
the author may wake up tomorrow and realize the ignorance of that
scheme. You are wasting your time if you have no control of this
markup.

[snip]
 
D

Doug Gunnoe

I have a situation where the data I want to pull off a website is
formatted at least 2 different ways that I'm currently aware of:
Sometimes I can pull the text like this:

userBody = IeDoc.getElementById("userbody").innerText

when the data is formatted like this:

<div id="userbody">
Beautiful Luxury Brick Home is West Cobb County. New; 4200 sq feet;
with 4 bedrooms, 3 1/2 BA,  Living room, Dining room, Bonus Room, Den
(27x18),

However, there are times when the data is formatted like this:

<div id="userbody">
<div style="float: left; height: 1px; width: 1px; overflow:
hidden">nights Backed fraud acre Contact Making Leases care refrig
Take other Wood wooded closets elegant garden Moen extra</
div>                        You can Rent to Own any of our homes by
tomorrow!<br>

And I don't know how to access it.  The data in this case starts with
"You can Rent to Own..."  Actually, I might know how to access it, I
could simply look for the next div tag.  The problem, is that I don't
know WHEN to look for that second div tag.  Well, that's not quite
true either, I could in fact check to see if the next <div> has a
"footer" class, because that's what shows up next on the first example
data, but it seems like a kludge.  I'm hoping there's a cleaner way -
such as perhaps being able to check to see if the div's overflow
attribute is hidden maybe?

Anyone know what I'm talking about and has a nice solution?

Thank you in advance.

use innerHTML instead and then strip the markup out using string
functions http://www.w3schools.com/jsref/jsref_obj_string.asp

or test for the second DIV using nextSibling or one of the various
other methods for traversing the DOM https://developer.mozilla.org/en/DOM/element#Methods
and maybe also check nodeType.

Out of curiosity, how are you pulling the data off of the website?
Because there are issues with accessing the DOM of a document from
another domain.
 
D

Dudely

use innerHTML instead and then strip the markup out using string
functionshttp://www.w3schools.com/jsref/jsref_obj_string.asp

Good idea, I'll have a look and see if innerHTML gives me the info. I
need.

or test for the second DIV using nextSibling or one of the various
other methods for traversing the DOMhttps://developer.mozilla.org/en/DOM/element#Methods
and maybe also check nodeType.

Yeah I was hoping for a cleaner method. I guess there really is no
clean method when the data isn't clean in the first place.

Out of curiosity, how are you pulling the data off of the website?
Because there are issues with accessing the DOM of a document from
another domain.- Hide quoted text -

That particular piece of data I'm getting exactly as described:
userBody = IeDoc.getElementById("userbody").innerText
I'm using VBA so there's not much point in going into more detail on
it, I'm sure the group police will get all bent out of shape if I do.
If you want to send me a private email I'd be happy to share
additional details.

Thank you most kindly for your ideas!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top