Access to database other web sites

Discussion in 'Python' started by Jenny, Sep 26, 2003.

  1. Jenny

    Jenny Guest

    I am doing research about realationship between sales rates and
    discounted prices or recommendation frequency. To do this, I need to
    access the database of commercial web sites via internet. I think this
    is possible because it it simmilar to the work of price comparison
    sites and web robot.

    I am studying python these days because I thinks it is a good language
    for the work. Actually I am a novice at python.

    I welcome any informaion about this problem. Thanks in advance.
    Jenny, Sep 26, 2003
    #1
    1. Advertising

  2. Jenny

    John J. Lee Guest

    (Jenny) writes:

    > I am doing research about realationship between sales rates and
    > discounted prices or recommendation frequency. To do this, I need to
    > access the database of commercial web sites via internet. I think this
    > is possible because it it simmilar to the work of price comparison
    > sites and web robot.


    IIUYC, what you're contemplating is called "web scraping" -- at least,
    it is by Cameron Laird, and I like the name. Others might know it as
    "web client programming". Cameron wrote an article about this a while
    back (Unix Review?) which you might like if you're a newbie -- Google
    for it (but note that the Perl book he mentions has actually been
    replaced by a newer one by Sean Burke, also from O'Reilly).


    > I am studying python these days because I thinks it is a good language
    > for the work.

    [...]

    I think so too.


    > I welcome any informaion about this problem. Thanks in advance.


    In the standard library, you'll want to look at these modules: httplib
    (low level HTTP -- you probably don't want to use this), urllib2
    (opens URLs as if they were files, handles redirections, proxies
    etc. for you) and HTMLParser. The standard library also includes
    sgmllib & htmllib, but you'll probably want to use HTMLParser instead
    if you want that kind of event-driven parsing at all. Regular
    expressions (re module) can also come in handy.

    Personally, I've decided that I prefer the DOM style of parsing for
    anything complicated -- it's just less work than the event-driven
    style (though I don't much like the DOM API). PyXML has an HTML DOM
    implementation called 4DOM. Use that together with mxTidy or
    uTidylib: they will clean up the horrid HTML you'll find on the web to
    the point where 4DOM can make sense of it. Another option is to use
    mxTidy/uTidylib to output XHTML, which allows you to use any XML DOM
    implementation -- eg. pxdom, minidom, libxml...

    You might find my modules useful too. ClientCookie has an interface
    just like urllib2 (and uses it to do its work), but handles cookies
    and some other stuff too. ClientForm makes it easier to work with
    HTML forms. ClientTable is currently a heap of junk, don't use it ;-)
    I've just rewritten ClientForm on top of the DOM, which lets you
    switch back and forth between the two APIs (and also lets you handle
    JavaScript, rather badly ATM) -- coming RSN...

    http://wwwsearch.sourceforge.net/


    The other, completely different, way of web scraping is to use the
    "automation" capabilities of the various big web browsers: Microsoft
    Internet Explorer, KDE's Konqueror and Mozilla are all scriptable from
    Python. You need the Python for Windows extensions, PyKDE or PyXPCOM
    respectively to control those browsers. Advantages: easy handling of
    JavaScript and other assorted nonsense, and they're generally
    reasonably well-tested and stable pieces of software (not to mention
    de-facto standards). Disadvantages: poor portability in some cases,
    and they're rather big, complicated, closed applications that are hard
    to modify (compared to the pure Python approach) and to distribute
    (which last, I guess, isn't a problem for you, since you'll be the
    only one using your software). Other problems: COM (for MSIE) is a
    bit of a headache for newbies, PyXPCOM last time I looked seemed a
    pain to install (Brendan Eich mentioned in a newsgroup post that that
    has been changing recently, though), and PyKDE might not be that well
    tested (it's a very big wrapper!).

    One other bunch of software worthy of mention: you can use Jython to
    access various Java libraries. HTTPClient and httpunit look like they
    might be useful. In particular, the latter has some JavaScript
    support.


    John
    John J. Lee, Sep 26, 2003
    #2
    1. Advertising

  3. Jenny

    John J. Lee Guest

    (John J. Lee) writes:

    > (Jenny) writes:
    >
    > > I am doing research about realationship between sales rates and
    > > discounted prices or recommendation frequency. To do this, I need to
    > > access the database of commercial web sites via internet. I think this

    [...]

    Forgot to say: if you don't already know, Google Groups can be worth
    its weight in round tuits. Try some searches there, in
    comp.lang.python, on the stuff I mentioned.


    John
    John J. Lee, Sep 26, 2003
    #3
  4. | IIUYC, what you're contemplating is called "web scraping"
    | ....

    John ....

    I did a bit of web scraping over the past week end
    for a friend that is interested in Lotto numbers ....

    The Lotto numbers were readily available on the web
    and presented as well-formed and readable HTML tables ....

    The primary problem I found up front was to be able
    parse and transform this data into something
    that Python, or any other language, might be able
    to cope with for subsequent analysis ....

    Since the number of records that I was dealing with
    in this case was relatively small, only a couple of thousand,
    I could manage the initial data transformations
    using my genetically encoded EyeBall parser,
    a text editor, and a couple of one-off Python scripts ....

    The first step in each case for the source files
    was using HTML Tidy to ...

    "clean up the horrid HTML you'll find on the web "

    I'd like to empashize for the benefit of the original poster
    that the initial data parsing will probably entail a fair amount
    of non-trivial work and that the subsequent data analysis
    and reporting will seem almost trivial by comparison ....

    Thanks for posting the info regarding different approaches,
    as I think it will be useful for me when I get around
    to replacing my EyeBall parser with something more effective ....

    --
    Cousin Stanley
    Human Being
    Phoenix, Arizona
    Cousin Stanley, Sep 26, 2003
    #4
  5. In article <>, John J. Lee <> wrote:
    > (Jenny) writes:
    >
    >> I am doing research about realationship between sales rates and
    >> discounted prices or recommendation frequency. To do this, I need to
    >> access the database of commercial web sites via internet. I think this
    >> is possible because it it simmilar to the work of price comparison
    >> sites and web robot.

    >
    >IIUYC, what you're contemplating is called "web scraping" -- at least,
    >it is by Cameron Laird, and I like the name. Others might know it as
    >"web client programming". Cameron wrote an article about this a while
    >back (Unix Review?) which you might like if you're a newbie -- Google
    >for it (but note that the Perl book he mentions has actually been
    >replaced by a newer one by Sean Burke, also from O'Reilly).
    >
    >
    >> I am studying python these days because I thinks it is a good language
    >> for the work.

    >[...]
    >
    >I think so too.

    .
    [excellent and detailed
    technical advice]
    .
    .
    Also filling a niche in this territory is PyCurl <URL: http://pycurl.sf.net >.
    The references at <URL: http://wiki.tcl.tk/WebScraping > are likely to be at
    least inspirational.

    I'm ... reserved about the prospects for the proposed research. The commercial
    sites you want to study are, in my experience, some of the most difficult to
    "scrape". Complementing that difficulty is the poverty of inference I antici-
    pate you'll be able to ground on what you find there; their commerce has a lot
    more noise than signal, as I see it. 'Twould be great, though, for you to
    uncover something real. Good luck.
    --

    Cameron Laird <>
    Business: http://www.Phaseit.net
    Personal: http://phaseit.net/claird/home.html
    Cameron Laird, Sep 27, 2003
    #5
  6. Jenny

    John J. Lee Guest

    (Cameron Laird) writes:

    > In article <>, John J. Lee <> wrote:
    > > (Jenny) writes:

    [...]
    > I'm ... reserved about the prospects for the proposed research. The commercial
    > sites you want to study are, in my experience, some of the most difficult to
    > "scrape".


    Which (ATM, anyway) is a good reason for doing it with browser automation.


    > Complementing that difficulty is the poverty of inference I antici-
    > pate you'll be able to ground on what you find there; their commerce has a lot
    > more noise than signal, as I see it.


    What do you mean 'their commerce has more noise than signal'?


    > 'Twould be great, though, for you to
    > uncover something real. Good luck.


    What I was wondering was where the sales data are going to come from.


    John
    John J. Lee, Sep 27, 2003
    #6
  7. In article <>, John J. Lee <> wrote:
    .
    .
    .
    >> Complementing that difficulty is the poverty of inference I antici-
    >> pate you'll be able to ground on what you find there; their commerce has a lot
    >> more noise than signal, as I see it.

    >
    >What do you mean 'their commerce has more noise than signal'?
    >
    >
    >> 'Twould be great, though, for you to
    >> uncover something real. Good luck.

    >
    >What I was wondering was where the sales data are going to come from.

    .
    .
    .
    That's a typical part. As I understand Jenny, she's going
    to look at, say, eBay, and correlate "sales" with "price"
    and "marketing" variables. I apologize for being obscure
    in abbreviating my judgment that that approach is likely to
    yield "more noise than signal"; you're quite right for ask-
    ing what I mean. What I mean by that is that all the
    variables strike me as poorly replicable, in at least three
    respects:
    A. eBay and other operators have an interest
    in releasing data only as they support
    their own success, and not for their
    analytic clarity. Their incentives to
    categorize and aggregate variables can do
    no more than to leave the underlying
    relations unbiased, and that's plenty
    unlikely.
    B. I suspect the universes are so small as
    to provide ittle inferential power. I'm
    most tentative about this one. I know
    eBay is big business, but I suspect that
    looking at any other operation will yield
    only data from an exceptional period, be-
    cause the businesses are *not* sustainable.
    C. Measurements of "marketing effort" and
    "promotion intensity" and other such quali-
    tative notions ... well, it sounds ambitious
    to me.
    --

    Cameron Laird <>
    Business: http://www.Phaseit.net
    Personal: http://phaseit.net/claird/home.html
    Cameron Laird, Sep 28, 2003
    #7
  8. Jenny

    John J. Lee Guest

    (Cameron Laird) writes:
    [...John wrote:]
    > >What I was wondering was where the sales data are going to come from.

    > .
    > .
    > .
    > That's a typical part. As I understand Jenny, she's going
    > to look at, say, eBay, and correlate "sales" with "price"
    > and "marketing" variables.



    Oh, ebay, I see. I was thinking about non-auction sites. On auction
    sites, some of the sales data are public, I suppose.


    John
    John J. Lee, Sep 28, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kevin Buchan
    Replies:
    1
    Views:
    461
    Eric Lawrence [MSFT]
    Feb 20, 2004
  2. Stefan Caliandro
    Replies:
    2
    Views:
    612
    Beauregard T. Shagnasty
    Feb 14, 2005
  3. Jasbird

    Sites about web-sites ?

    Jasbird, Sep 12, 2006, in forum: HTML
    Replies:
    1
    Views:
    390
  4. imx
    Replies:
    10
    Views:
    789
  5. Yitzak

    2 sites or not 2 sites

    Yitzak, Mar 7, 2009, in forum: ASP .Net
    Replies:
    5
    Views:
    431
    Andrew Morton
    Mar 10, 2009
Loading...

Share This Page