Seeking examples of screen scraping....

J

Jim

I want to extract data from several websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of the
multiple websites I have to visit now to get the same info). There are no
open APIs or webservices for these websites that I am aware of.

I am using VS 2005 and VB.Net. If you could point out some sample code (or
controls to accomplush the same thing), I'd really appreciate it. (C# - and
even VS 2003 are OK)

Thanks!
 
G

Guest

Jim,
If you intend to get serious about this you are probably going to want to
learn to use a library. Take a look at Simon Mourier's HtmlAgilityPack.
Peter
 
J

Jim

KJ said:
A google search on the terms ".net screen scrape html" brings up a
great many options.

Gee!! Thanks! I hadn't thought of that.

(Now, for the rest of you with working frontal lobes, I'd still like to see
what you have. Personal recommendations are always better than random
searches.)
 
K

KJ

You know Jim, I actually thought what I wrote was helpful. And I also
think your sarcasm is out of line.
 
J

Jim

KJ said:
You know Jim, I actually thought what I wrote was helpful. And I also
think your sarcasm is out of line.

And I think your lazy answer is out of line and sarcastic.

I really get tired of seeing people respond to posts by simply saying
"google it".

If you think the poster is so dense that they don't know how to use search
engines, you should probably skip replying at all as it would do little
good.

Posting a reply like "google it" is a waste of bandwidth and time to those
that view these newsgroups.

Helpful and pertinent posts are welcomed and appreciated. "Google it" is
neither helpful nor pertinent.

How many newsgroup users do you think have not heard of or used Google?

BTW.....your precious Google results only give answers (one of which is
repeated at least 4 times in the first 20 examples - with 2 other repeat
answers accounting for 5 more of the first 20 results) that are very
elementary. The reason for posting the request here is to get more in-depth
answers from the knowledgable people that frequent the newsgroups.

If I have need of a simplistic, irrelevant result I will most assuredly
"Goggle it".

Jim
 
N

Nick Malik [Microsoft]

G

Gabriel Magana

You know, you are just getting help that's worth what you paid for it... If
you disagree with the reply, follow your own advice and skip it, no need to
make frontal lobe comments.
 
C

Cyril Gupta

Hello Jim,

Those of us who choose to help others on the newsgroup do it not because we
are paid but out of a desire to help fellow coders and maybe because other
coders help us. It's a chain.

Your attitude leaves a lot to be desired. Your question is un-specific,
about a very broad topic, and you have not presented a particular
programming problem. You want an answer that will give you the complete
overview of the solution without making any effort from your side to write
code or evolve a strategy to solve the problem.

Even a very basic search could tell you that you can retrieve the data of a
webpage using the HttpWebRequest object, and from then on it's a question of
logic.

I don't think you should be so rude on the newsgroup to people who care to
answer, or maybe after a while nobody will care to answer.

Regards
Cyril Gupta
 
J

Jim

Cyril Gupta said:
Hello Jim,

Those of us who choose to help others on the newsgroup do it not because
we are paid but out of a desire to help fellow coders and maybe because
other coders help us. It's a chain.

I've probably asnwered more cries for help in ngs than you've ever read. I
am familiar with the concept.
Your attitude leaves a lot to be desired. Your question is un-specific,
about a very broad topic, and you have not presented a particular
programming problem.

Really? What exactly would you call "I want to extract data from several
websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of the
multiple websites I have to visit now to get the same info)." ?

Do you think that the exact websites or page info would alter the answer
given? If so, you don't understand the question.
You want an answer that will give you the complete overview of the solution
without making any effort from your side to write code or evolve a strategy
to solve the problem.

Did Sylvia Brown tell you this, or are you a budding psychic yourself?

Either way, you missed with that assumption completely. I was actively
working on the solution before I made the post and continued to do so
afterwards.

But, let's assume (since you evidentally like to do that) that your
assumption was right. Programmers, like myself, give away code snippets to
others to save them time and effort and as a tool that they can learn from.
We even have entire sites dedicated to the task.

Ever hear of Planet Source Code or The Code Project or SourceForge? Perhaps
you should log on to those sites and tell the users how lazy they all are.
(PLEASE let me know if you do......I wouldn't miss it for the world!)

What if Microsoft put out the .Net 2.0 framework with your "you try and
figure it out" attitude? You'd just have to figure out how the entire .Net
2.0 framework works. And you'd probably be just as productive as your post
to this thread.
Even a very basic search could tell you that you can retrieve the data of
a webpage using the HttpWebRequest object, and from then on it's a
question of logic.

Well, duh. I acknowledged that Google gives simplistic examples (like the
one you suggest) that gets the whole page. What I wanted to know (and if
you'd read the OP, you'd know this) was the most efficient way to extract
data from the page.
I don't think you should be so rude on the newsgroup to people who care to
answer, or maybe after a while nobody will care to answer.

And I don't think that you should appoint yourself the NG-Police. So?
Neither of us cares what the other thinks so why are you wasting even more
bandwidth with your tripe?

If my scolding posters for posting irrelevant, "Google it" posts, or tripe
like you have posted, means that people with no answer (like yourself)
ignore my posts, GREAT! I'm sure others will appreciate your NOT posting
irrelevant material to my threads almost as much a I will.

Have a nice life! And I hope that people post more relevant responses to
your requests than you have to mine.

Jim
 
J

Jim

See my reply to Cryil Gupta...

Gabriel Magana said:
You know, you are just getting help that's worth what you paid for it...
If you disagree with the reply, follow your own advice and skip it, no
need to make frontal lobe comments.
 
J

Jim

This is an excellent starting point. Thank you for posting it.

What I am wondering is if there is a way to load the results into an object
that allows one to extract data as if it were a recordset. Have you seen
anything like that?

Jim
 
J

Jim

Excellent! It not only gets the page, but extracts the text from the page.

But, I am wonderin if there is a way to load a "webpage object" and query it
like a recordset. Seen anything like that?

Jim
 
A

alex_f_il

Look at
SWExplorerAutomation(http://home.comcast.net/~furmana/SWIEAutomation.htm)

SW Explorer Automation (SWEA) creates an object model (automation
interface) for any Web application running in Internet Explorer. The
automation interface consists of pages (scenes) and controls. The page
consists of controls. The following controls are supported:
HtmlContent, HtmlAnchor, HtmlImage, HtmlInputButton, HtmlInputCheckBox,
HtmlInputRadioButton, HtmlInputText, HtmlSelect, HtmlTextArea. The
object model is defined visually by SWEA designer. The designer allows
to record scripts (C# and VB) based on the defined application object
model.

It is very easy to create a scraping solution for any Web site using
SWEA.
 
R

Registered User

Excellent! It not only gets the page, but extracts the text from the page.

But, I am wonderin if there is a way to load a "webpage object" and query it
like a recordset. Seen anything like that?
Something like the HTMLDocumentClass type perhaps?
If so mshtml.dll is the place to look.

regards
A.G.
 
H

Herfried K. Wagner [MVP]

Jim said:
I want to extract data from several websites that I visit daily. I'd like
to condense the info into a single web page that I can visit (instead of
the multiple websites I have to visit now to get the same info). There are
no open APIs or webservices for these websites that I am aware of.

Parsing an HTML file:

MSHTML Reference
<URL:http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp>

- or -

..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

- or -

SgmlReader 1.4
<URL:http://www.gotdotnet.com/Community/...mpleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC>

If the file read is in XHTML format, you can use the classes contained in
the 'System.Xml' namespace for reading information from the file.
 
R

Rudderius

I'm affraid that what you are asking for is very difficult. The reason I
think this is the following: ever heard about the semantic web?

In other words: getting all text from a webpage is a peace of cake,
getting a perticular part of a webpage is much more difficult as there
is not point to refer to.

I've read in another post in this question that you want to use a kind
of query. Well here is the problem; you want a query like: get results
form soccer_game. Well the problem is to define soccer_game...

The only thing you can do is trying to find a fixed point (like 5th
<p>-element, or <div> element with id-attribute set to "soccer_game")

So, the way I think you should solve your problem is a. getting the page
as a html (xml) document b. defining a point (tag) to get the data from.

greetz and succes,
Rudderius
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,158
Latest member
Vinay_Kumar Nevatia
Top