Mechanize for BIG website scrapping...

Discussion in 'Ruby' started by Horacio Sanson, Sep 21, 2006.

  1. I am using Mechanize for several projects that require me to download large
    amounts of html pages from a web site. Since I am working with about a 1000
    pages the limitations of mechanize started to appear...

    Try this code

    ################################################
    require 'rubygems'
    require 'mechanize'

    agent = WWW::Mechanize.new

    prev = 0
    curr = 0
    prev_pages = 0
    curr_pages = 0

    1000.times do
    page = agent.get("http://yourfavoritepage.com")
    curr = 0
    curr_pages = 0
    # Count the total number of objects and the number of WWW::Mechanize::page
    # objects.
    ObjectSpace.each_object { |o|
    curr += 1
    curr_pages += 1 if o.class == WWW::Mechanize::page
    }
    puts "There are #{curr} (#{curr - prev}) objects"
    puts "There are #{curr_pages} (#{curr_pages - prev_pages}) objects"
    prev = curr
    prev_pages = curr_pages
    GC.enable
    GC.start
    sleep 1.0 # This avoids the script of taking 100% CPU
    end

    ############################################

    The output of this script repeals that at each iteration a
    WWW::Mechanize::page object gets created (along with a lot of other objects)
    and they never get GarbageCollected. So you can see your RAM flying away in
    each iteration and never returning back.

    Now this can be solved by putting the agent = WWW::Mechanize.new inside the
    block like:

    ############################################

    1000.times do
    agent = WWW::Mechanize.new <-- CHANGE IS HERE
    page = agent.get("http://yourfavoritepage.com")
    curr = 0
    curr_pages = 0
    # Count the total number of objects and the number of WWW::Mechanize::page

    ..... the rest is the same
    #############################################


    With this change we see that the max number of WWW::Mechanize::page objects
    never increases more then three and the other objects increase and decrease
    in the order of 60 per iteration.


    Does this means that the WWW::Mechanize object keeps references of all the
    pages downloaded?? and those pages are not gonna be GarbageCollected until
    the WWW::Mechanize object is alive?

    In my script I cannot remove the WWW::Mechanize object since this page in
    particular is a form and requires cookies state information to be able to
    access to the pages I need to download. Is there a way to tell the Mechanize
    Object to delete the pages alreade downloaded??

    regards,
    Horacio
     
    Horacio Sanson, Sep 21, 2006
    #1
    1. Advertisements

  2. What if you save the cookies out to a file?
    WWW::Mechanize::CookieJar has a #save_as and #load method to save and
    restore cookies.
    I actually ran into a similar issue recently; your diagnosis explains
    why my program used too much memory.

    You might try the following (assuming "browser" is your
    WWW::Mechanize object):

    browser.page.content.replace "" # that's an empty string
    browser.page.root.children = []

    That should clear both the original text, and the parsed HTML. I'm
    not sure whether this would get rid of all the references, but at
    least it should help.

    --John
     
    John Labovitz, Sep 21, 2006
    #2
    1. Advertisements

  3. Thanks for your answer but I found how to fix this problem.

    A look at the Mechanize code reveals that each page loaded is stored in a=20
    history hash inside the Mechanize object. This means that as long as the=20
    Mechanize object exist the pages will never go away.

    Solution?? simply set the history_max value to something more coherent than=
    =20
    infinite.

    ############################
    agent =3D WWW::Mechanize.new
    agent.history_max =3D 10
    ############################

    and that's it... no more memory hungry Mechanize.

    I noted that setting this value to zero gives some problems when submiting=
    =20
    forms. So don't set it up to zero. Even one seems to work ok.

    Hope this helps,

    Horacio

    =E6=9C=A8=E6=9B=9C=E6=97=A5 21 9=E6=9C=88 2006 14:03=E3=80=81John Labovitz =
    =E3=81=95=E3=82=93=E3=81=AF=E6=9B=B8=E3=81=8D=E3=81=BE=E3=81=97=E3=81=9F:
     
    Horacio Sanson, Sep 21, 2006
    #3
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.