need start point for getting html info from web

N

nephish

hey there,

i have a small app that i am going to need to get information from a
few tables on different websites. i have looked at urllib and httplib.
the sites i need to get data from mostly have this data in tables. So
that, i think would make it easier. Anyone suggest a good starting
point for me to find out how to do this, or know of a link to a good
how-to?
thanks,
sk
 
M

Mike Meyer

i have a small app that i am going to need to get information from a
few tables on different websites. i have looked at urllib and httplib.
the sites i need to get data from mostly have this data in tables. So
that, i think would make it easier. Anyone suggest a good starting
point for me to find out how to do this, or know of a link to a good
how-to?

Don't have a link to a howto. But you're halfway there. urllib (and
urllib2) will get HTML text from the websites. Pulling data from it
sort of depends on the nature of the HTML. If it's well-structured
XHTML, you can use your favorite xml library. if it's well structured
HTML, you can try htmllib, but it's pretty primitive. If it's not
well-structured, you can use BeautifulSoup. I've used it to pull data
from tables. The problem with any of this is that your code really
depends on the structure - or lack thereof - of the HTML you're
scraping. If they change it, your code breaks.

<mike
 
N

nephish

yeah, i know i am going to have to write a bunch of stuff because the
values i want to get come from several different sites. ah-well, just
wanting to know the easiest way to learn how to get started. i will
check into beautiful soup, i think i have heard it referred to before.
thanks
shawn
 
P

Paul McGuire

hey there,

i have a small app that i am going to need to get information from a
few tables on different websites. i have looked at urllib and httplib.
the sites i need to get data from mostly have this data in tables. So
that, i think would make it easier. Anyone suggest a good starting
point for me to find out how to do this, or know of a link to a good
how-to?
thanks,
sk
pyparsing comes with a simple HTML scraper example for extracting the NIST
NTP servers from an HTML table. pyparsing is also fairly tolerant of
"unclean" HTML. Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
 
A

alex_f_il

You can easily do it with SW Explorer Automation
(http://home.comcast.net/~furmana/SWIEAutomation.htm).
The program creates an automation API for any Web application which
uses HTML and DHTML and works with Microsoft Internet Explorer. The Web
application becomes programmatically accessible from any .NET language.


The tool has Visual Table Data Extractor. It allows visually define the
table structure. The table becomes accessible from the code as
DataTable class. You can develop the extraction script in hours with
the tool.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top