JavaScript and Screenscraping

R

Roedy Green

I am working on a screenscraping project that is turning out to much
more time-consuming that I thought it would be. I am trying to gather
a database of information about all the motherboards sold my major
manufacturers. The idea is to eventually create a comparison shopper
to help you narrow down models that fit your needs.

Oddly motherboard manufacturers don't use a database and generate
their specification pages. These are all hand-compiled with theme and
a dozen variations on every field. This is can handle.

However, Asus decided to obfuscate their web pages with JavaScript.
There are no data on them.

I wondered if there exists a tool that is like browser in that it will
read a page and render the JavaScript, but unlike a browser, it would
not show the information on the screen, just dump the generated HTML
or raw text and accept a script of pages to analyse.
 
M

Michal Kleczek

Roedy said:
I am working on a screenscraping project that is turning out to much
more time-consuming that I thought it would be. I am trying to gather
a database of information about all the motherboards sold my major
manufacturers. The idea is to eventually create a comparison shopper
to help you narrow down models that fit your needs.

Oddly motherboard manufacturers don't use a database and generate
their specification pages. These are all hand-compiled with theme and
a dozen variations on every field. This is can handle.

However, Asus decided to obfuscate their web pages with JavaScript.
There are no data on them.

I wondered if there exists a tool that is like browser in that it will
read a page and render the JavaScript, but unlike a browser, it would
not show the information on the screen, just dump the generated HTML
or raw text and accept a script of pages to analyse.

http://htmlunit.sourceforge.net/
 
T

Tom Anderson


Finally, someone else who knows about it!

tom

--
For the first few years I ate lunch with he mathematicians. I soon found
that they were more interested in fun and games than in serious work,
so I shifted to eating with the physics table. There I stayed for a
number of years until the Nobel Prize, promotions, and offers from
other companies, removed most of the interesting people. So I shifted
to the corresponding chemistry table where I had a friend. At first I
asked what were the important problems in chemistry, then what important
problems they were working on, or problems that might lead to important
results. One day I asked, "if what they were working on was not important,
and was not likely to lead to important things, they why were they working
on them?" After that I had to eat with the engineers! -- R. W. Hamming
 
R

Roedy Green


What I am doing is similar.

I want to track price information from multiple sources, and track all
MBs I can find, not just ones sold by one particular vendor. It is a
comparison shopper, though it could be used by a retailer. I have
different categories, leaning more toward those you would use to
eliminate some motherboards from consideration, rather than categorise
branding info. I wrote MB companies asking for computer-friendly
sources of info. They have not been forthcoming.
Perhaps they will if the thing catches on.

It is just amazing how many goofy things that vendors do that
interfere with scraping.

Here is my current database schema:

/** presume database mother pre-existing, create with dbcreate if
necessary */

DROP TABLE IF EXISTS mboards;
DROP TABLE IF EXISTS sellers;
DROP TABLE IF EXISTS prices;

CREATE TABLE mboards (

/* no cache, no slots */
manufacturer numeric( 2 ) NOT NULL, /* enum */
model varchar ( 30 ) NOT NULL,
manufacturerPartNo VARCHAR ( 30 ),
revision varchar ( 8 ),
formFactor numeric ( 2 ), /* enum */
widthInCm numeric ( 3, 1 ),
heightInCm numeric ( 3, 1 ),
socket numeric( 2 ), /* enum */
video varchar( 40 ),
memoryType numeric ( 2 ), /* enum */
maxGig numeric ( 3 ),
ramSpeedMhz numeric ( 4 ),
usb2 numeric ( 2 ),
usb2Internal numeric ( 2 ),
usb2Rear numeric ( 2 ),
usb3 numeric ( 2 ),
usb3Internal numeric ( 2 ),
usb3Rear numeric ( 2 ),
sata2 numeric ( 2 ),
sata3 numeric ( 2 ),
watts numeric ( 4 ),
theatreSound boolean,
lastUpdated numeric( 7) );

CREATE TABLE prices (

seller numeric ( 2 ) NOT NULL, /* enum */
manufacturer numeric( 2 ) NOT NULL,
model varchar ( 30 ) NOT NULL,
sellerPartNo varchar ( 50 ),
currency CHAR( 3 ),
price numeric ( 6, 2 ),
lastUpdated numeric( 7));
 
D

Dr J R Stockton

In comp.lang.java.programmer message <rvc6p6toumdlevjb48ohjnlf1gur128eqe
@4ax.com>, Wed, 30 Mar 2011 06:51:29, Roedy Green <see_website@mindprod.
com.invalid> posted:
I wondered if there exists a tool that is like browser in that it will
read a page and render the JavaScript, but unlike a browser, it would
not show the information on the screen, just dump the generated HTML
or raw text and accept a script of pages to analyse.

A JavaScript newsgroup might know.

But JavaScript used as you describe does not necessarily generate HTML,
but can manipulate the DOM tree directly.

Or are you thinking of server-side scripting with .php?
 
R

Roedy Green

But JavaScript used as you describe does not necessarily generate HTML,
but can manipulate the DOM tree directly.

Or are you thinking of server-side scripting with .php?

I am just trying to go to motherboard manufacturer websites and
collect specs from the webpages. The webpages often contain a lot of
Javascript. The data does not appear in any form. Presumably the Java
script loads more Java script or resources then formats it.
 
D

Dr J R Stockton

In comp.lang.java.programmer message <t64dp61er3n5cbkpuippmpji0dlaijbsnm
@4ax.com>, Fri, 1 Apr 2011 20:00:27, Roedy Green <[email protected]
om.invalid> posted:
I am just trying to go to motherboard manufacturer websites and
collect specs from the webpages. The webpages often contain a lot of
Javascript. The data does not appear in any form. Presumably the Java
script loads more Java script or resources then formats it.

Probably but not entirely presumably; if using an iframe, there could be
no need for reformatting.

Given a URL or two as examples, and a clear indication of what is to be
scraped, one might be able to understand the situation better.
 
R

RedGrittyBrick

I wondered if there exists a tool that is like browser in that it will
read a page and render the JavaScript, but unlike a browser, it would
not show the information on the screen, just dump the generated HTML
or raw text and accept a script of pages to analyse.

http://links.twibright.com/features.php:

"Links runs in text mode (mouse optional) on UN*X console, ssh/telnet
virtual terminal, vt100 terminal, xterm, and virtually any other text
terminal. "

Links2 supports Javascript.

I haven't used it but it seems to have command line options, maybe, like
Lynx, some of them allow you to save the HTML to a file?

Open Source, so if the GPL is usable for your project, you can probably
repurpose it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top