how to screen scrape content + images

R

rachel

Hello,

I am currently contracted out by a real estate agent. He
has a page that he has created himself that has a list of
homes.. their images and data in html format.

He wants me to take this page and reformat it so that it
looks different.
Do I use screen scraping to do this?
Could someone please point me to a good screen scraping
article... I am using ASP.NET and C#

Thanks,
Rachel
 
J

Juan T. Llibre

R

rachel

Hi Juan,
Thanks for the quick reply.
Are you are saying that I can use Teleport Pro with
ASP.NET to get the desired outcome?
I have to use ASP.NET as well because the website has
other functions that it performs.

Thanks for your help, I look forward to your reply.
Rachel
 
J

Juan T. Llibre

re:
Are you are saying that I can use Teleport Pro
with ASP.NET to get the desired outcome?

No, no.

I thought you only wanted to get the pages and images,
so you could reformat the presentation, in order to
later proceed to write the code in ASP.NET.

Teleport Pro allows you to replicate the directory
structure of the site, and then you could write
your ASP.NET application using the same image
directory structure which your client is using currently.

There's some gotchas, like if you client uses a database
to store the images ( I hope not ) but, essentially,
using a website downloader lets you get the basics.

Hints :

There's a free Open Source applications called nGallery
http://www.ngallery.org/ which might give you some ideas
about how to handle ASP.NET code for retrieving/displaying
images and manipulating descriptions, etc.

If you're familiar with ASP, maybe it would help you to take
a look at this free Real Estate website code at the Code Project :
http://www.codeproject.com/useritems/real-estate-website.asp

Good luck!


Juan T. Llibre
ASP.NET MVP
===========
 
M

MWells

Rachel,

If your extraction is a one-time effort, designed to gather the basic
content for your new version of the website, it's easiest to use a tool like
Juan recommended or even just extract the details by hand. Real-estate
listings can be fairly complex, containing a couple of hundred fields per
property listing, so you might consider whipping up some tools for yourself
to rend the data from the page. Regular expressions are very useful for
this purpose.

If your content-extraction need is recurring, I would at all costs avoid
screen scraping. That's akin to using their existing website as a database
for your new site. Among other things, it means they have to keep their old
site running somewhere and in good working order.

Instead, do some digging to find out where the content is originating from.
If they're taking the photographs and entering the content directly into
their website themselves, you'll probably have to mimic that functionality
through a set of web-based administrative tools. In that case you may be
able to skip the listing-content extraction entirely, build the tools, and
have your client re-enter all of the listing. Sell the idea as
"training"... =)

There's a good chance that they are using a third party provider to acquire
the listings, or are feeding the data in directly from their local MLS. In
the US, most multiple listing services (MLSs) now comply with the national
IDX and VOW standards for publishing listings. Assuming your client's MLS
does, you can acquire a developer license and pull the content yourself from
the MLS, store it in a database, and then embed the data in the website as
desired.

We do this for the Chicago region, so I should note that the effort is all
fairly significant. The raw data is often published daily in large CSV
files (100 MB+ in size), retrieved from an FTP server. It's fully
de-normalized so you probably want to do a ton of scrubbing and
normalization to make it useful. You'll likely need to decode all of the
fields to English text so that the general public can make sense of the
listing content. Images are also often FTP'd although some MLS's offer URL
access to the photos for active listings (i.e. you'd have to cache some if
you want to display sold listings for your client). In the VOW ("Virtual
Office Website") program, regulations are such that you also need to have an
enrollment process before visitors are permitted to see the listings, do an
email address verification by sending an account activation email, etc. etc.
etc.

Nothing insurmountable, but expect to grind some code if you go this route.
Alternately, you may be able to find a third party service to handle the
listing display entirely, and if your client likes the appearance (you
rarely have choices...), then you can just focus on the rest of the website.

/// M
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top