JavaScript and Screenscraping

Discussion in 'Java' started by Roedy Green, Mar 30, 2011.

  1. Roedy Green

    Roedy Green Guest

    I am working on a screenscraping project that is turning out to much
    more time-consuming that I thought it would be. I am trying to gather
    a database of information about all the motherboards sold my major
    manufacturers. The idea is to eventually create a comparison shopper
    to help you narrow down models that fit your needs.

    Oddly motherboard manufacturers don't use a database and generate
    their specification pages. These are all hand-compiled with theme and
    a dozen variations on every field. This is can handle.

    However, Asus decided to obfuscate their web pages with JavaScript.
    There are no data on them.

    I wondered if there exists a tool that is like browser in that it will
    read a page and render the JavaScript, but unlike a browser, it would
    not show the information on the screen, just dump the generated HTML
    or raw text and accept a script of pages to analyse.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    There are only two industries that refer to their customers as "users".
    ~ Edward Tufte
    Roedy Green, Mar 30, 2011
    #1
    1. Advertising

  2. Roedy Green wrote:

    > I am working on a screenscraping project that is turning out to much
    > more time-consuming that I thought it would be. I am trying to gather
    > a database of information about all the motherboards sold my major
    > manufacturers. The idea is to eventually create a comparison shopper
    > to help you narrow down models that fit your needs.
    >
    > Oddly motherboard manufacturers don't use a database and generate
    > their specification pages. These are all hand-compiled with theme and
    > a dozen variations on every field. This is can handle.
    >
    > However, Asus decided to obfuscate their web pages with JavaScript.
    > There are no data on them.
    >
    > I wondered if there exists a tool that is like browser in that it will
    > read a page and render the JavaScript, but unlike a browser, it would
    > not show the information on the screen, just dump the generated HTML
    > or raw text and accept a script of pages to analyse.
    >


    http://htmlunit.sourceforge.net/

    --
    Michal
    Michal Kleczek, Mar 30, 2011
    #2
    1. Advertising

  3. Roedy Green

    Tom Anderson Guest

    On Wed, 30 Mar 2011, Michal Kleczek wrote:

    > Roedy Green wrote:
    >
    >> I am working on a screenscraping project that is turning out to much
    >> more time-consuming that I thought it would be. I am trying to gather
    >> a database of information about all the motherboards sold my major
    >> manufacturers. The idea is to eventually create a comparison shopper
    >> to help you narrow down models that fit your needs.
    >>
    >> Oddly motherboard manufacturers don't use a database and generate
    >> their specification pages. These are all hand-compiled with theme and
    >> a dozen variations on every field. This is can handle.
    >>
    >> However, Asus decided to obfuscate their web pages with JavaScript.
    >> There are no data on them.
    >>
    >> I wondered if there exists a tool that is like browser in that it will
    >> read a page and render the JavaScript, but unlike a browser, it would
    >> not show the information on the screen, just dump the generated HTML
    >> or raw text and accept a script of pages to analyse.

    >
    > http://htmlunit.sourceforge.net/


    Finally, someone else who knows about it!

    tom

    --
    For the first few years I ate lunch with he mathematicians. I soon found
    that they were more interested in fun and games than in serious work,
    so I shifted to eating with the physics table. There I stayed for a
    number of years until the Nobel Prize, promotions, and offers from
    other companies, removed most of the interesting people. So I shifted
    to the corresponding chemistry table where I had a friend. At first I
    asked what were the important problems in chemistry, then what important
    problems they were working on, or problems that might lead to important
    results. One day I asked, "if what they were working on was not important,
    and was not likely to lead to important things, they why were they working
    on them?" After that I had to eat with the engineers! -- R. W. Hamming
    Tom Anderson, Mar 31, 2011
    #3
  4. Roedy Green

    Roedy Green Guest

    On Wed, 30 Mar 2011 07:40:32 -0700, Peter Duniho
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Already done. For example:
    >http://www.newegg.com/Store/SubCategory.aspx?SubCategory=2


    What I am doing is similar.

    I want to track price information from multiple sources, and track all
    MBs I can find, not just ones sold by one particular vendor. It is a
    comparison shopper, though it could be used by a retailer. I have
    different categories, leaning more toward those you would use to
    eliminate some motherboards from consideration, rather than categorise
    branding info. I wrote MB companies asking for computer-friendly
    sources of info. They have not been forthcoming.
    Perhaps they will if the thing catches on.

    It is just amazing how many goofy things that vendors do that
    interfere with scraping.

    Here is my current database schema:

    /** presume database mother pre-existing, create with dbcreate if
    necessary */

    DROP TABLE IF EXISTS mboards;
    DROP TABLE IF EXISTS sellers;
    DROP TABLE IF EXISTS prices;

    CREATE TABLE mboards (

    /* no cache, no slots */
    manufacturer numeric( 2 ) NOT NULL, /* enum */
    model varchar ( 30 ) NOT NULL,
    manufacturerPartNo VARCHAR ( 30 ),
    revision varchar ( 8 ),
    formFactor numeric ( 2 ), /* enum */
    widthInCm numeric ( 3, 1 ),
    heightInCm numeric ( 3, 1 ),
    socket numeric( 2 ), /* enum */
    video varchar( 40 ),
    memoryType numeric ( 2 ), /* enum */
    maxGig numeric ( 3 ),
    ramSpeedMhz numeric ( 4 ),
    usb2 numeric ( 2 ),
    usb2Internal numeric ( 2 ),
    usb2Rear numeric ( 2 ),
    usb3 numeric ( 2 ),
    usb3Internal numeric ( 2 ),
    usb3Rear numeric ( 2 ),
    sata2 numeric ( 2 ),
    sata3 numeric ( 2 ),
    watts numeric ( 4 ),
    theatreSound boolean,
    lastUpdated numeric( 7) );

    CREATE TABLE prices (

    seller numeric ( 2 ) NOT NULL, /* enum */
    manufacturer numeric( 2 ) NOT NULL,
    model varchar ( 30 ) NOT NULL,
    sellerPartNo varchar ( 50 ),
    currency CHAR( 3 ),
    price numeric ( 6, 2 ),
    lastUpdated numeric( 7));


    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    There are only two industries that refer to their customers as "users".
    ~ Edward Tufte
    Roedy Green, Mar 31, 2011
    #4
  5. In comp.lang.java.programmer message <rvc6p6toumdlevjb48ohjnlf1gur128eqe
    @4ax.com>, Wed, 30 Mar 2011 06:51:29, Roedy Green <see_website@mindprod.
    com.invalid> posted:

    >I wondered if there exists a tool that is like browser in that it will
    >read a page and render the JavaScript, but unlike a browser, it would
    >not show the information on the screen, just dump the generated HTML
    >or raw text and accept a script of pages to analyse.


    A JavaScript newsgroup might know.

    But JavaScript used as you describe does not necessarily generate HTML,
    but can manipulate the DOM tree directly.

    Or are you thinking of server-side scripting with .php?

    --
    (c) John Stockton, nr London UK. ?@merlyn.demon.co.uk IE8 FF3 Op10 Sf5 Cr7
    news:comp.lang.javascript FAQ <http://www.jibbering.com/faq/index.html>.
    <http://www.merlyn.demon.co.uk/js-index.htm> jscr maths, dates, sources.
    <http://www.merlyn.demon.co.uk/> TP/BP/Delphi/jscr/&c, FAQ items, links.
    Dr J R Stockton, Apr 1, 2011
    #5
  6. Roedy Green

    Roedy Green Guest

    On Fri, 1 Apr 2011 23:39:32 +0100, Dr J R Stockton
    <> wrote, quoted or indirectly quoted
    someone who said :

    >But JavaScript used as you describe does not necessarily generate HTML,
    >but can manipulate the DOM tree directly.
    >
    >Or are you thinking of server-side scripting with .php?


    I am just trying to go to motherboard manufacturer websites and
    collect specs from the webpages. The webpages often contain a lot of
    Javascript. The data does not appear in any form. Presumably the Java
    script loads more Java script or resources then formats it.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Doing what the user expects with respect to navigation is absurdly important for user satisfaction.
    ~ anonymous Google Android developer
    Roedy Green, Apr 2, 2011
    #6
  7. In comp.lang.java.programmer message <t64dp61er3n5cbkpuippmpji0dlaijbsnm
    @4ax.com>, Fri, 1 Apr 2011 20:00:27, Roedy Green <
    om.invalid> posted:

    >On Fri, 1 Apr 2011 23:39:32 +0100, Dr J R Stockton
    ><> wrote, quoted or indirectly quoted
    >someone who said :
    >
    >>But JavaScript used as you describe does not necessarily generate HTML,
    >>but can manipulate the DOM tree directly.
    >>
    >>Or are you thinking of server-side scripting with .php?

    >
    >I am just trying to go to motherboard manufacturer websites and
    >collect specs from the webpages. The webpages often contain a lot of
    >Javascript. The data does not appear in any form. Presumably the Java
    >script loads more Java script or resources then formats it.


    Probably but not entirely presumably; if using an iframe, there could be
    no need for reformatting.

    Given a URL or two as examples, and a clear indication of what is to be
    scraped, one might be able to understand the situation better.

    --
    (c) John Stockton, nr London, UK. ?@merlyn.demon.co.uk Turnpike v6.05.
    Website <http://www.merlyn.demon.co.uk/> - w. FAQish topics, links, acronyms
    PAS EXE etc. : <http://www.merlyn.demon.co.uk/programs/> - see in 00index.htm
    Dates - miscdate.htm estrdate.htm js-dates.htm pas-time.htm critdate.htm etc.
    Dr J R Stockton, Apr 3, 2011
    #7
  8. On 30/03/2011 14:51, Roedy Green wrote:
    > I wondered if there exists a tool that is like browser in that it will
    > read a page and render the JavaScript, but unlike a browser, it would
    > not show the information on the screen, just dump the generated HTML
    > or raw text and accept a script of pages to analyse.
    >


    http://links.twibright.com/features.php:

    "Links runs in text mode (mouse optional) on UN*X console, ssh/telnet
    virtual terminal, vt100 terminal, xterm, and virtually any other text
    terminal. "

    Links2 supports Javascript.

    I haven't used it but it seems to have command line options, maybe, like
    Lynx, some of them allow you to save the HTML to a file?

    Open Source, so if the GPL is usable for your project, you can probably
    repurpose it.

    --
    RGB
    RedGrittyBrick, Apr 5, 2011
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?Um9iIFJlYWdhbg==?=

    ScreenScraping and Viewstate

    =?Utf-8?B?Um9iIFJlYWdhbg==?=, Dec 7, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    1,865
    Joe Fallon
    Dec 8, 2004
  2. Dan Stromberg - Datallegro
    Replies:
    1
    Views:
    333
    John J. Lee
    Aug 9, 2007
  3. Peter Bodik
    Replies:
    2
    Views:
    97
    Peter Bodik
    Jan 21, 2006
  4. parez
    Replies:
    0
    Views:
    119
    parez
    Sep 11, 2007
  5. porter

    Javascript and IE? Javascript and C#?

    porter, Oct 5, 2007, in forum: Javascript
    Replies:
    6
    Views:
    267
    Thomas 'PointedEars' Lahn
    Oct 6, 2007
Loading...

Share This Page