Access to database other web sites

J

Jenny

I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this
is possible because it it simmilar to the work of price comparison
sites and web robot.

I am studying python these days because I thinks it is a good language
for the work. Actually I am a novice at python.

I welcome any informaion about this problem. Thanks in advance.
 
J

John J. Lee

I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this
is possible because it it simmilar to the work of price comparison
sites and web robot.

IIUYC, what you're contemplating is called "web scraping" -- at least,
it is by Cameron Laird, and I like the name. Others might know it as
"web client programming". Cameron wrote an article about this a while
back (Unix Review?) which you might like if you're a newbie -- Google
for it (but note that the Perl book he mentions has actually been
replaced by a newer one by Sean Burke, also from O'Reilly).

I am studying python these days because I thinks it is a good language
for the work.
[...]

I think so too.

I welcome any informaion about this problem. Thanks in advance.

In the standard library, you'll want to look at these modules: httplib
(low level HTTP -- you probably don't want to use this), urllib2
(opens URLs as if they were files, handles redirections, proxies
etc. for you) and HTMLParser. The standard library also includes
sgmllib & htmllib, but you'll probably want to use HTMLParser instead
if you want that kind of event-driven parsing at all. Regular
expressions (re module) can also come in handy.

Personally, I've decided that I prefer the DOM style of parsing for
anything complicated -- it's just less work than the event-driven
style (though I don't much like the DOM API). PyXML has an HTML DOM
implementation called 4DOM. Use that together with mxTidy or
uTidylib: they will clean up the horrid HTML you'll find on the web to
the point where 4DOM can make sense of it. Another option is to use
mxTidy/uTidylib to output XHTML, which allows you to use any XML DOM
implementation -- eg. pxdom, minidom, libxml...

You might find my modules useful too. ClientCookie has an interface
just like urllib2 (and uses it to do its work), but handles cookies
and some other stuff too. ClientForm makes it easier to work with
HTML forms. ClientTable is currently a heap of junk, don't use it ;-)
I've just rewritten ClientForm on top of the DOM, which lets you
switch back and forth between the two APIs (and also lets you handle
JavaScript, rather badly ATM) -- coming RSN...

http://wwwsearch.sourceforge.net/


The other, completely different, way of web scraping is to use the
"automation" capabilities of the various big web browsers: Microsoft
Internet Explorer, KDE's Konqueror and Mozilla are all scriptable from
Python. You need the Python for Windows extensions, PyKDE or PyXPCOM
respectively to control those browsers. Advantages: easy handling of
JavaScript and other assorted nonsense, and they're generally
reasonably well-tested and stable pieces of software (not to mention
de-facto standards). Disadvantages: poor portability in some cases,
and they're rather big, complicated, closed applications that are hard
to modify (compared to the pure Python approach) and to distribute
(which last, I guess, isn't a problem for you, since you'll be the
only one using your software). Other problems: COM (for MSIE) is a
bit of a headache for newbies, PyXPCOM last time I looked seemed a
pain to install (Brendan Eich mentioned in a newsgroup post that that
has been changing recently, though), and PyKDE might not be that well
tested (it's a very big wrapper!).

One other bunch of software worthy of mention: you can use Jython to
access various Java libraries. HTTPClient and httpunit look like they
might be useful. In particular, the latter has some JavaScript
support.


John
 
J

John J. Lee

[...]

Forgot to say: if you don't already know, Google Groups can be worth
its weight in round tuits. Try some searches there, in
comp.lang.python, on the stuff I mentioned.


John
 
C

Cousin Stanley

| IIUYC, what you're contemplating is called "web scraping"
| ....

John ....

I did a bit of web scraping over the past week end
for a friend that is interested in Lotto numbers ....

The Lotto numbers were readily available on the web
and presented as well-formed and readable HTML tables ....

The primary problem I found up front was to be able
parse and transform this data into something
that Python, or any other language, might be able
to cope with for subsequent analysis ....

Since the number of records that I was dealing with
in this case was relatively small, only a couple of thousand,
I could manage the initial data transformations
using my genetically encoded EyeBall parser,
a text editor, and a couple of one-off Python scripts ....

The first step in each case for the source files
was using HTML Tidy to ...

"clean up the horrid HTML you'll find on the web "

I'd like to empashize for the benefit of the original poster
that the initial data parsing will probably entail a fair amount
of non-trivial work and that the subsequent data analysis
and reporting will seem almost trivial by comparison ....

Thanks for posting the info regarding different approaches,
as I think it will be useful for me when I get around
to replacing my EyeBall parser with something more effective ....
 
C

Cameron Laird

[email protected] (Jenny) said:
I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this
is possible because it it simmilar to the work of price comparison
sites and web robot.

IIUYC, what you're contemplating is called "web scraping" -- at least,
it is by Cameron Laird, and I like the name. Others might know it as
"web client programming". Cameron wrote an article about this a while
back (Unix Review?) which you might like if you're a newbie -- Google
for it (but note that the Perl book he mentions has actually been
replaced by a newer one by Sean Burke, also from O'Reilly).

I am studying python these days because I thinks it is a good language
for the work.
[...]

I think so too.
.
[excellent and detailed
technical advice]
.
.
Also filling a niche in this territory is PyCurl <URL: http://pycurl.sf.net >.
The references at <URL: http://wiki.tcl.tk/WebScraping > are likely to be at
least inspirational.

I'm ... reserved about the prospects for the proposed research. The commercial
sites you want to study are, in my experience, some of the most difficult to
"scrape". Complementing that difficulty is the poverty of inference I antici-
pate you'll be able to ground on what you find there; their commerce has a lot
more noise than signal, as I see it. 'Twould be great, though, for you to
uncover something real. Good luck.
 
J

John J. Lee

(e-mail address removed) (Jenny) writes:
[...]
I'm ... reserved about the prospects for the proposed research. The commercial
sites you want to study are, in my experience, some of the most difficult to
"scrape".

Which (ATM, anyway) is a good reason for doing it with browser automation.

Complementing that difficulty is the poverty of inference I antici-
pate you'll be able to ground on what you find there; their commerce has a lot
more noise than signal, as I see it.

What do you mean 'their commerce has more noise than signal'?

'Twould be great, though, for you to
uncover something real. Good luck.

What I was wondering was where the sales data are going to come from.


John
 
C

Cameron Laird

.
.
.
What do you mean 'their commerce has more noise than signal'?



What I was wondering was where the sales data are going to come from.
.
.
.
That's a typical part. As I understand Jenny, she's going
to look at, say, eBay, and correlate "sales" with "price"
and "marketing" variables. I apologize for being obscure
in abbreviating my judgment that that approach is likely to
yield "more noise than signal"; you're quite right for ask-
ing what I mean. What I mean by that is that all the
variables strike me as poorly replicable, in at least three
respects:
A. eBay and other operators have an interest
in releasing data only as they support
their own success, and not for their
analytic clarity. Their incentives to
categorize and aggregate variables can do
no more than to leave the underlying
relations unbiased, and that's plenty
unlikely.
B. I suspect the universes are so small as
to provide ittle inferential power. I'm
most tentative about this one. I know
eBay is big business, but I suspect that
looking at any other operation will yield
only data from an exceptional period, be-
cause the businesses are *not* sustainable.
C. Measurements of "marketing effort" and
"promotion intensity" and other such quali-
tative notions ... well, it sounds ambitious
to me.
 
J

John J. Lee

.
.
.
That's a typical part. As I understand Jenny, she's going
to look at, say, eBay, and correlate "sales" with "price"
and "marketing" variables.


Oh, ebay, I see. I was thinking about non-auction sites. On auction
sites, some of the sales data are public, I suppose.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top