Help extracting info from HTML source ..

s. d. rose · Jan 26, 2007

Hello All.
I am learning Python, and have never worked with HTML. However, I would
like to write a simple script to audit my 100+ Netware servers via their web
portal.

I was reading Chapter 8 of Dive into Python, which deals with this topic.
In the web portal of the server, there is a section similar to this:

--> clients and <A
href="http://eugenia.blogsome.com/?s=ipkall">clever</a> services. <--

which I took from SlashDot, but what I'm talking about is using the word
'services' to represent the link to eugenia.blogsome.com.

What I'd like to do is save the two pieces of info relative to the server
name. Probably in a dictionary, such as server1[link] to the page on
eugenia.blogsome.com and server1[description] to 'services'.

I've used the example from Dive into Python to get the actual link in the
source of the HTML, but I don't know how to get the text that is the
hyperlink.

So in the portal, I've got a link 'Scheduled Server Reboot' going to say
/ScheduledTasks/ID000000003/ on Server1, using similar to above clipped HTML
source code.

Can someone please help me? Sure, I could manually go to each server, but I
wouldn't learn anything. I've learned some, but also have real deadlines,
so I eagerly hope for any assistance & instruction.

Thank you!
-Dave
Shelton, CT

Miki · Jan 26, 2007

Hello Shelton,

I am learning Python, and have never worked with HTML. However, I would
like to write a simple script to audit my 100+ Netware servers via their web
portal.

Always use the right tool, BeautilfulSoup
(http://www.crummy.com/software/BeautifulSoup/) is best for web
scraping (IMO).

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

html = urlopen("http://www.python.org").read()
soup = BeautifulSoup(html)
for link in soup("a"):
print link["href"], "-->", link.contents

HTH,

Nikita the Spider · Jan 26, 2007

"Miki said:
Hello Shelton,

I am learning Python, and have never worked with HTML. However, I would
like to write a simple script to audit my 100+ Netware servers via their web
portal.

Click to expand...

Always use the right tool, BeautilfulSoup
(http://www.crummy.com/software/BeautifulSoup/) is best for web
scraping (IMO).

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

html = urlopen("http://www.python.org").read()
soup = BeautifulSoup(html)
for link in soup("a"):
print link["href"], "-->", link.contents

Agreed. HTML scraping is really complicated once you get into it. It
might be interesting to write such a library just for your own
satisfaction, but if you want to get something done then use a module
that already written, like BeautifulSoup. Another module that will do
the same job but works differently (and more simply, IMO) is HTMLData by
Connelly Barnes:
http://oregonstate.edu/~barnesc/htmldata/

Using JS to verify registration info?	1	Mar 19, 2020
Python client/server that reads HTML body from server	1	Apr 12, 2023
pip's wheel support requires setuptools >= 0.8 for dist-info support	0	Jan 3, 2014
When I send email as HTML, why do erroneous whitespaces getintroduced to the HTML source and a few <	2	Nov 8, 2013
I dont get this. Please help me!!	2	Jan 24, 2023
Generating HTML	0	Jul 29, 2013
Running code from source that includes extension modules	0	Oct 2, 2013
Help with Python Flask on PI as server SSE to website	0	Apr 23, 2022

Help extracting info from HTML source ..

s. d. rose

Miki

Nikita the Spider

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads