Help extracting info from HTML source ..

S

s. d. rose

Hello All.
I am learning Python, and have never worked with HTML. However, I would
like to write a simple script to audit my 100+ Netware servers via their web
portal.

I was reading Chapter 8 of Dive into Python, which deals with this topic.
In the web portal of the server, there is a section similar to this:

--> clients and <A
href="http://eugenia.blogsome.com/?s=ipkall">clever</a> services. <--

which I took from SlashDot, but what I'm talking about is using the word
'services' to represent the link to eugenia.blogsome.com.

What I'd like to do is save the two pieces of info relative to the server
name. Probably in a dictionary, such as server1[link] to the page on
eugenia.blogsome.com and server1[description] to 'services'.

I've used the example from Dive into Python to get the actual link in the
source of the HTML, but I don't know how to get the text that is the
hyperlink.

So in the portal, I've got a link 'Scheduled Server Reboot' going to say
/ScheduledTasks/ID000000003/ on Server1, using similar to above clipped HTML
source code.

Can someone please help me? Sure, I could manually go to each server, but I
wouldn't learn anything. I've learned some, but also have real deadlines,
so I eagerly hope for any assistance & instruction.

Thank you!
-Dave
Shelton, CT
 
M

Miki

Hello Shelton,
I am learning Python, and have never worked with HTML. However, I would
like to write a simple script to audit my 100+ Netware servers via their web
portal.
Always use the right tool, BeautilfulSoup
(http://www.crummy.com/software/BeautifulSoup/) is best for web
scraping (IMO).

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

html = urlopen("http://www.python.org").read()
soup = BeautifulSoup(html)
for link in soup("a"):
print link["href"], "-->", link.contents

HTH,
 
N

Nikita the Spider

"Miki said:
Hello Shelton,
I am learning Python, and have never worked with HTML. However, I would
like to write a simple script to audit my 100+ Netware servers via their web
portal.
Always use the right tool, BeautilfulSoup
(http://www.crummy.com/software/BeautifulSoup/) is best for web
scraping (IMO).

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

html = urlopen("http://www.python.org").read()
soup = BeautifulSoup(html)
for link in soup("a"):
print link["href"], "-->", link.contents

Agreed. HTML scraping is really complicated once you get into it. It
might be interesting to write such a library just for your own
satisfaction, but if you want to get something done then use a module
that already written, like BeautifulSoup. Another module that will do
the same job but works differently (and more simply, IMO) is HTMLData by
Connelly Barnes:
http://oregonstate.edu/~barnesc/htmldata/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top