help!! extra tricky web page to extract data from...

seberino · Mar 13, 2007

How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.

Sincerely,

Chris

Diez B. Roggisch · Mar 13, 2007

How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.

It's an AJAX-site. You have to carefully analyze it and see what
actually happens in the javascript, then use that. Maybe something like
the http header plugin for firefox helps you there.

Diez

Max Erickson · Mar 13, 2007

[email protected] said:
How extract the visible numerical data from this Microsoft
financial web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow
right in a Python script?

Any help greatly appreciated.

Sincerely,

Chris

The url for the data is in an iframe. If you need to scrape the
original page for some reason(instead of iframe url directly), you can
use urlparse.urljoin to resolve the relative url.

max

Diez B. Roggisch · Mar 13, 2007

It's an AJAX-site. You have to carefully analyze it and see what

actually happens in the javascript, then use that. Maybe something like
the http header plugin for firefox helps you there.

ups, obviously I wasn't looking enough at the site. Sorry for the confusion.

Still, some pages are AJAX, you won't be able to scrape them easily
without analyzing the JS code.

Diez

Paul Rubin · Mar 13, 2007

Diez B. Roggisch said:
Still, some pages are AJAX, you won't be able to scrape them easily
without analyzing the JS code.

Sooner or later it would be great to have a JS interpreter written in
Python for this purpose. It would do all the same operations on an
HTML/XML DOM that a browser does, basically all the stuff of a browser
except rendering into pixels. JS semantics are similar enough to
Python that maybe the JS could be compiled into Python byte code.

Diez B. Roggisch · Mar 13, 2007

Paul said:
Sooner or later it would be great to have a JS interpreter written in
Python for this purpose. It would do all the same operations on an
HTML/XML DOM that a browser does, basically all the stuff of a browser
except rendering into pixels. JS semantics are similar enough to
Python that maybe the JS could be compiled into Python byte code.

Nice idea, but not really helpful in the end. Besides the rather nasty
parts of the DOMs that make JS programming the PITA it is, I think the
whole event-based stuff makes this basically impossible.

Diez

Paul Rubin · Mar 13, 2007

Diez B. Roggisch said:
Nice idea, but not really helpful in the end. Besides the rather nasty
parts of the DOMs that make JS programming the PITA it is, I think the
whole event-based stuff makes this basically impossible.

Obviously the Python interface would need ways to send events into the
DOM, simulating timer ticks, mouse clicks, and so forth, just like
urllib in a sense simulates a user navigating a browser.

Diez B. Roggisch · Mar 13, 2007

Paul said:
Obviously the Python interface would need ways to send events into the
DOM, simulating timer ticks, mouse clicks, and so forth, just like
urllib in a sense simulates a user navigating a browser.

Obviously this wouldn't really help, as you can't predict what a website
actually wants which events, in possibly which order. Especially if the
site does not _want_ to be scrapable- think of a simple "click on the
images in the order of the numbers shown on them" captcha.

Most time it's easier to sniff the http stream & grab the data directly.

Diez

Paul Rubin · Mar 13, 2007

Diez B. Roggisch said:
Obviously this wouldn't really help, as you can't predict what a
website actually wants which events, in possibly which
order. Especially if the site does not _want_ to be scrapable- think
of a simple "click on the images in the order of the numbers shown on
them" captcha.

Sure, but most sites don't go to such lengths, and even captchas can
be defeated if you're trying to scrape a specific site and are willing
to spend effort on the particular captcha generator that it uses.
Plus there is always www.captchasolver.com (!).

Most time it's easier to sniff the http stream & grab the data directly.

Certainly true, but there are times when you have to pull stuff out of
the JS. It's usually possible to do that without actually
interpreting the JS, but an interpreter would make it a lot more
convenient some of the time.

John Nagle · Mar 13, 2007

How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.

Been there, done that, years ago. Try this:

http://www.downside.com/cgi/testfin...es/edgar/data/886158/0001104659-06-034196.txt

That will get you the data you're looking for.
If you want to try other companies, start at the query box on
"http://www.downside.com".

The data is actually coming from the United States Securities and Exchange
Commission's EDGAR web site, where companies are required to file their
financial statements. The filings are intended to be read by humans, but
it's possible to parse many filings mechanically. They're supposed to be
in HTML 3.2, but this isn't enforced.

There are many EDGAR parsers, some better than ours. To do a really good one,
you have to license a patent from Price Waterhouse. Try
"http://www.10kwizard.com/", which has an API for retrieving this info.
It's not free.

John Nagle

Steve Holden · Mar 14, 2007

Paul said:
Sure, but most sites don't go to such lengths, and even captchas can
be defeated if you're trying to scrape a specific site and are willing
to spend effort on the particular captcha generator that it uses.
Plus there is always www.captchasolver.com (!).

I especially like the rems and conditions they ask you to acknowledge if
you want to sign up as a worker:

http://www.captchasolver.com/join/worker#

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Paul Rubin · Mar 14, 2007

Steve Holden said:
I especially like the rems and conditions they ask you to acknowledge
if you want to sign up as a worker:
http://www.captchasolver.com/join/worker#

Heh, cute, I guess you have to solve a different type of puzzle to
read them.

I'm surprised anyone is purporting to pay actual money for captcha
solutions. The usual scheme I've herad (dunno if anyone actually does
it) is to feed the captchas you want to solve into a porn site, so
people give you solutions in order to keep viewing porn. You then
funnel the solutions back to the forms you're actually trying to
automate.

I think captchas are proving reasonably effective as a speed bump but
they do get defeated all the time, whether through automatic means or
otherwise.

Sending data from web page to Raspberry Pi	0	Nov 26, 2022
What's the best way to extract 2 values from a CSV file from each row systematically?	6	Sep 23, 2013
Help to extract data from a web page	2	Aug 25, 2007
extract data from web page	16	Jul 9, 2007
Example Script to parse web page links and extract data?	1	Sep 14, 2005
Best way to extract numeric values from a report?	4	May 7, 2009
What's the very simplest way to run some Python from a button on a web page?	3	Jan 21, 2012
How extract data from XHTML Transitional web pages? got xml.dom.minidom troubles..	4	Mar 2, 2007

help!! extra tricky web page to extract data from...

seberino

Diez B. Roggisch

Max Erickson

Diez B. Roggisch

Paul Rubin

Diez B. Roggisch

Paul Rubin

Diez B. Roggisch

Paul Rubin

John Nagle

Steve Holden

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

help!! *extra* tricky web page to extract data from...

seberino

Diez B. Roggisch

Max Erickson

Diez B. Roggisch

Paul Rubin

Diez B. Roggisch

Paul Rubin

Diez B. Roggisch

Paul Rubin

John Nagle

Steve Holden

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

help!! extra tricky web page to extract data from...