help!! *extra* tricky web page to extract data from...

S

seberino

How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.

Sincerely,

Chris
 
D

Diez B. Roggisch

How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.

It's an AJAX-site. You have to carefully analyze it and see what
actually happens in the javascript, then use that. Maybe something like
the http header plugin for firefox helps you there.

Diez
 
M

Max Erickson

How extract the visible numerical data from this Microsoft
financial web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow
right in a Python script?

Any help greatly appreciated.

Sincerely,

Chris

The url for the data is in an iframe. If you need to scrape the
original page for some reason(instead of iframe url directly), you can
use urlparse.urljoin to resolve the relative url.


max
 
D

Diez B. Roggisch

It's an AJAX-site. You have to carefully analyze it and see what
actually happens in the javascript, then use that. Maybe something like
the http header plugin for firefox helps you there.


ups, obviously I wasn't looking enough at the site. Sorry for the confusion.

Still, some pages are AJAX, you won't be able to scrape them easily
without analyzing the JS code.

Diez
 
P

Paul Rubin

Diez B. Roggisch said:
Still, some pages are AJAX, you won't be able to scrape them easily
without analyzing the JS code.

Sooner or later it would be great to have a JS interpreter written in
Python for this purpose. It would do all the same operations on an
HTML/XML DOM that a browser does, basically all the stuff of a browser
except rendering into pixels. JS semantics are similar enough to
Python that maybe the JS could be compiled into Python byte code.
 
D

Diez B. Roggisch

Paul said:
Sooner or later it would be great to have a JS interpreter written in
Python for this purpose. It would do all the same operations on an
HTML/XML DOM that a browser does, basically all the stuff of a browser
except rendering into pixels. JS semantics are similar enough to
Python that maybe the JS could be compiled into Python byte code.

Nice idea, but not really helpful in the end. Besides the rather nasty
parts of the DOMs that make JS programming the PITA it is, I think the
whole event-based stuff makes this basically impossible.

Diez
 
P

Paul Rubin

Diez B. Roggisch said:
Nice idea, but not really helpful in the end. Besides the rather nasty
parts of the DOMs that make JS programming the PITA it is, I think the
whole event-based stuff makes this basically impossible.

Obviously the Python interface would need ways to send events into the
DOM, simulating timer ticks, mouse clicks, and so forth, just like
urllib in a sense simulates a user navigating a browser.
 
D

Diez B. Roggisch

Paul said:
Obviously the Python interface would need ways to send events into the
DOM, simulating timer ticks, mouse clicks, and so forth, just like
urllib in a sense simulates a user navigating a browser.

Obviously this wouldn't really help, as you can't predict what a website
actually wants which events, in possibly which order. Especially if the
site does not _want_ to be scrapable- think of a simple "click on the
images in the order of the numbers shown on them" captcha.

Most time it's easier to sniff the http stream & grab the data directly.

Diez
 
P

Paul Rubin

Diez B. Roggisch said:
Obviously this wouldn't really help, as you can't predict what a
website actually wants which events, in possibly which
order. Especially if the site does not _want_ to be scrapable- think
of a simple "click on the images in the order of the numbers shown on
them" captcha.

Sure, but most sites don't go to such lengths, and even captchas can
be defeated if you're trying to scrape a specific site and are willing
to spend effort on the particular captcha generator that it uses.
Plus there is always www.captchasolver.com (!).
Most time it's easier to sniff the http stream & grab the data directly.

Certainly true, but there are times when you have to pull stuff out of
the JS. It's usually possible to do that without actually
interpreting the JS, but an interpreter would make it a lot more
convenient some of the time.
 
J

John Nagle

How extract the visible numerical data from this Microsoft financial
web site?

http://tinyurl.com/yw2w4h

If you simply download the HTML file you'll see the data is *not*
embedded in it but loaded from some other file.

Surely if I can see the data in my browser I can grab it somehow right
in a Python script?

Any help greatly appreciated.

Been there, done that, years ago. Try this:

http://www.downside.com/cgi/testfin...es/edgar/data/886158/0001104659-06-034196.txt

That will get you the data you're looking for.
If you want to try other companies, start at the query box on
"http://www.downside.com".

The data is actually coming from the United States Securities and Exchange
Commission's EDGAR web site, where companies are required to file their
financial statements. The filings are intended to be read by humans, but
it's possible to parse many filings mechanically. They're supposed to be
in HTML 3.2, but this isn't enforced.

There are many EDGAR parsers, some better than ours. To do a really good one,
you have to license a patent from Price Waterhouse. Try
"http://www.10kwizard.com/", which has an API for retrieving this info.
It's not free.

John Nagle
 
S

Steve Holden

Paul said:
Sure, but most sites don't go to such lengths, and even captchas can
be defeated if you're trying to scrape a specific site and are willing
to spend effort on the particular captcha generator that it uses.
Plus there is always www.captchasolver.com (!).
I especially like the rems and conditions they ask you to acknowledge if
you want to sign up as a worker:

http://www.captchasolver.com/join/worker#

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007
 
P

Paul Rubin

Steve Holden said:
I especially like the rems and conditions they ask you to acknowledge
if you want to sign up as a worker:
http://www.captchasolver.com/join/worker#

Heh, cute, I guess you have to solve a different type of puzzle to
read them.

I'm surprised anyone is purporting to pay actual money for captcha
solutions. The usual scheme I've herad (dunno if anyone actually does
it) is to feed the captchas you want to solve into a porn site, so
people give you solutions in order to keep viewing porn. You then
funnel the solutions back to the forms you're actually trying to
automate.

I think captchas are proving reasonably effective as a speed bump but
they do get defeated all the time, whether through automatic means or
otherwise.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,269
Latest member
vinaykumar_nevatia23

Latest Threads

Top