simple (I hope!) screen scraping script

S

Sylvia

Hello,

I have a task that I think perl would be super for, but it's been so
long (7 years since I touched perl!) that I'm not sure where to start.

Basically, on this page:

http://www.cityofbellevue.org/page.asp?view=38806

There's a set of links to docs. I need to:

1. get the text from each doc file (can perl work with doc files?)

2. then put all the crime reports into one text file, ordered first by
District, and then by Date.

Would this be pretty easy? Any pointers, especially on doing the
opening of the doc files?

thanks much!
Sylvia
 
P

Paul Lalli

Sylvia said:
I have a task that I think perl would be super for, but it's been so
long (7 years since I touched perl!) that I'm not sure where to start.

Basically, on this page:

http://www.cityofbellevue.org/page.asp?view=38806

There's a set of links to docs. I need to:

1. get the text from each doc file (can perl work with doc files?)

2. then put all the crime reports into one text file, ordered first by
District, and then by Date.

Would this be pretty easy? Any pointers, especially on doing the
opening of the doc files?

For working with Microsoft Word documents:
use the Win32::OLE module
Translate the object model Microsoft provides from VB code to Perl:
http://msdn.microsoft.com/library/d...us/vbawd11/html/WordVBAWelcome_HV01135786.asp

For retrieving a webpage:
use the LWP::Simple module

For working with links on a website:
use the HTML::Extractor module

All of the above modules can be found at http://search.cpan.org

Hope this helps you get started,
Paul Lalli
 
P

Paul Lalli

John said:
Is that very different from HTML::LinkExtor?

Can't say for certain, as I've never used that one. However, from
LinkExtractor's docs:

HTML::LinkExtractor is used for extracting links from HTML. It is
very
similar to HTML::LinkExtor, except that besides getting the URL, you
also
get the link-text.

Paul Lalli
 
J

John Bokma

Paul Lalli said:
Can't say for certain, as I've never used that one. However, from
LinkExtractor's docs:

HTML::LinkExtractor is used for extracting links from HTML. It is
very
similar to HTML::LinkExtor, except that besides getting the URL, you
also
get the link-text.

Thanks Paul, I was just wondering. Have used HTML::LinkExtor recently and
it did what I wanted. But I'll remember the LinkExtractor for when I need
the text as well.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top