How to save a webpage contents to a file ( with LWP )

Discussion in 'Perl Misc' started by Jack, Feb 20, 2008.

  1. Jack

    Jack Guest

    Hi there, does anyone skilled in the art of LWP (or other perl module)
    and screen scraping know how to do the equivalent of a "file", "save
    as" html content ? Some webpages arent scrapeable but when you save
    down their content to a local file its available. Any ideas would be
    great.

    Also, if there is a drop down + button to select content BUT in the
    HTML source no "submit" entry at all, how does one remote control a
    user selection without this post handle ?

    Thanks in advance,

    Jack
    Jack, Feb 20, 2008
    #1
    1. Advertising

  2. Jack <> wrote in news::

    > Hi there, does anyone skilled in the art of LWP (or other perl module)
    > and screen scraping know how to do the equivalent of a "file", "save
    > as" html content ?


    http://search.cpan.org/~gaas/libwww-perl-5.808/lib/LWP/Simple.pm

    getstore($url, $file)

    http://search.cpan.org/~gaas/libwww-perl-5.808/lib/LWP.pm#The_Response_Object

    http://search.cpan.org/~gaas/libwww-perl-5.808/lib/HTTP/Response.pm

    $r->content( $content )

    This is used to get/set the raw content

    $r->decoded_content( %options )

    This will return the content after any Content-Encoding and charsets
    has been decoded.

    > Also, if there is a drop down + button to select content BUT in the
    > HTML source no "submit" entry at all, how does one remote control a
    > user selection without this post handle ?


    If the page uses Javascript to dynamically post form contents, you will
    have to figure out what the Javascript does and replicate it.

    Sinan


    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)
    clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>
    A. Sinan Unur, Feb 20, 2008
    #2
    1. Advertising

  3. Jack

    Jack Guest

    On Feb 20, 5:49 am, "A. Sinan Unur" <> wrote:
    > Jack <> wrote innews::
    >
    > > Hi there, does anyone skilled in the art of LWP (or other perl module)
    > > and screen scraping know how to do the equivalent of a "file", "save
    > > as" html content ?

    >
    > http://search.cpan.org/~gaas/libwww-perl-5.808/lib/LWP/Simple.pm
    >
    > getstore($url, $file)
    >
    > http://search.cpan.org/~gaas/libwww-perl-5.808/lib/LWP.pm#The_Respons...
    >
    > http://search.cpan.org/~gaas/libwww-perl-5.808/lib/HTTP/Response.pm
    >
    > $r->content( $content )
    >
    >     This is used to get/set the raw content
    >
    > $r->decoded_content( %options )
    >
    >     This will return the content after any Content-Encoding and charsets
    >     has been decoded.
    >
    > > Also, if there is a drop down + button to select content BUT in the
    > > HTML source no "submit" entry at all, how does one remote control a
    > > user selection without this post handle ?

    >
    > If the page uses Javascript to dynamically post form contents, you will
    > have to figure out what the Javascript does and replicate it.
    >
    > Sinan
    >
    > --
    > A. Sinan Unur <>
    > (remove .invalid and reverse each component for email address)
    > clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>


    Hi Sinan the site uses ASP, no JS files.. this is all there is in the
    html
    <!--<SCRIPT>
    //
    </SCRIPT>-->
    <FRAMESET ROWS="70,*" FRAMESPACING=0>
    <FRAME NAME="header" SRC="./header_default.asp?
    NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
    MARGINHEIGHT="0">

    <FRAME NAME="bodyx" SRC=
    body.asp?centerin=GGCC
    SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">


    </FRAMESET>

    </HTML>
    Jack, Feb 20, 2008
    #3
  4. Jack <> wrote in
    news::

    > On Feb 20, 5:49 am, "A. Sinan Unur" <> wrote:
    >> Jack <> wrote
    >> innews:412be207-d043-4b9d-bd96-252942

    > :
    >>
    >> > Hi there, does anyone skilled in the art of LWP (or other perl
    >> > module) and screen scraping know how to do the equivalent of a
    >> > "file", "save as" html content ?

    >>
    >> http://search.cpan.org/~gaas/libwww-perl-5.808/lib/LWP/Simple.pm
    >>
    >> getstore($url, $file)
    >>
    >> http://search.cpan.org/~gaas/libwww-perl-

    5.808/lib/LWP.pm#The_Respons.
    >> ..
    >>
    >> http://search.cpan.org/~gaas/libwww-perl-5.808/lib/HTTP/Response.pm
    >>
    >> $r->content( $content )
    >>
    >>     This is used to get/set the raw content
    >>
    >> $r->decoded_content( %options )
    >>
    >>     This will return the content after any Content-Encoding and
    >> charse

    > ts
    >>     has been decoded.
    >>
    >> > Also, if there is a drop down + button to select content BUT in the
    >> > HTML source no "submit" entry at all, how does one remote control a
    >> > user selection without this post handle ?

    >>
    >> If the page uses Javascript to dynamically post form contents, you
    >> will have to figure out what the Javascript does and replicate it.
    >>
    >> Sinan
    >>
    >> --
    >> A. Sinan Unur <>


    Do *not* quote sigs.

    > Hi Sinan the site uses ASP, no JS files.. this is all there is in the
    > html
    > <!--<SCRIPT>
    > //
    > </SCRIPT>-->
    > <FRAMESET ROWS="70,*" FRAMESPACING=0>
    > <FRAME NAME="header" SRC="./header_default.asp?
    > NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
    > MARGINHEIGHT="0">
    >
    > <FRAME NAME="bodyx" SRCbody.asp?centerin=GGCC


    I am assuming you retyped the source rather than copied & pasting.
    Please don't retype code.

    > SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">


    Oh, but there is more. How about them frames?

    Anyway, this forum is for help with the Perl aspect of things. If you
    need to learn html, there is a group for that as well.

    Sinan
    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)
    clpmisc guidelines: <URL:http://www.rehabitation.com/clpmisc.shtml>
    A. Sinan Unur, Feb 20, 2008
    #4
  5. Jack wrote:
    > this is all there is in the html
    > <!--<SCRIPT>
    > //
    > </SCRIPT>-->
    > <FRAMESET ROWS="70,*" FRAMESPACING=0>
    > <FRAME NAME="header" SRC="./header_default.asp?
    > NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
    > MARGINHEIGHT="0">
    >
    > <FRAME NAME="bodyx" SRC=
    > body.asp?centerin=GGCC
    > SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">
    >
    >
    > </FRAMESET>
    >
    > </HTML>


    Then get the bodyx frame, not the frameset.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 20, 2008
    #5
  6. Jack

    Jack Guest

    On Feb 20, 8:08 am, Gunnar Hjalmarsson <> wrote:
    > Jack wrote:
    > > this is all there is in the html
    > >   <!--<SCRIPT>
    > >    //
    > >   </SCRIPT>-->
    > >   <FRAMESET ROWS="70,*" FRAMESPACING=0>
    > >    <FRAME NAME="header" SRC="./header_default.asp?
    > > NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
    > > MARGINHEIGHT="0">

    >
    > >    <FRAME NAME="bodyx" SRC=
    > > body.asp?centerin=GGCC
    > >    SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">

    >
    > > </FRAMESET>

    >
    > > </HTML>

    >
    > Then get the bodyx frame, not the frameset.
    >
    > --
    > Gunnar Hjalmarsson
    > Email:http://www.gunnar.cc/cgi-bin/contact.pl- Hide quoted text -
    >
    > - Show quoted text -


    How exactly does one get the bodyx frame, and more importantly how do
    you auto select from the select box when there is no such mention of
    it or a submit button in html for this ASP application.
    Thank you,
    Jack
    Jack, Feb 20, 2008
    #6
  7. Jack wrote:
    > On Feb 20, 8:08 am, Gunnar Hjalmarsson <> wrote:
    >> Jack wrote:
    >>> this is all there is in the html
    >>> <!--<SCRIPT>
    >>> //
    >>> </SCRIPT>-->
    >>> <FRAMESET ROWS="70,*" FRAMESPACING=0>
    >>> <FRAME NAME="header" SRC="./header_default.asp?
    >>> NoCache=2%2F20%2F2008+7%3A35%3A47+AM" SCROLLING="no" MARGINWIDTH="2"
    >>> MARGINHEIGHT="0">
    >>> <FRAME NAME="bodyx" SRC=
    >>> body.asp?centerin=GGCC
    >>> SCROLLING="auto" MARGINWIDTH="2" MARGINHEIGHT="2">
    >>> </FRAMESET>
    >>> </HTML>

    >>
    >> Then get the bodyx frame, not the frameset.

    >
    > How exactly does one get the bodyx frame,


    Assuming the URL of the frameset is
    http://www.example.com/somepage/index.asp, you probably use the URL
    http://www.example.com/somepage/body.asp?centerin=GGCC

    > and more importantly how do
    > you auto select from the select box when there is no such mention of
    > it or a submit button in html for this ASP application.


    As Sinan mentioned, you apparently need to learn some basics about HTML.
    Asking questions in a Perl group is not the right way to do so.

    Recommended reading: http://www.w3.org/TR/html4/present/frames.html

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 21, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. user
    Replies:
    1
    Views:
    753
    Bruce Barker
    Apr 4, 2005
  2. Kamarulnizam Rahim
    Replies:
    4
    Views:
    204
    Robert Klemme
    Jan 28, 2011
  3. John

    How to save lwp::useragent state?

    John, Apr 28, 2004, in forum: Perl Misc
    Replies:
    1
    Views:
    111
    J. Gleixner
    Apr 28, 2004
  4. Hal Vaughan

    LWP Doesn't Seem To Save Cookies:

    Hal Vaughan, Mar 23, 2005, in forum: Perl Misc
    Replies:
    7
    Views:
    249
    Joe Smith
    Apr 5, 2005
  5. sifar
    Replies:
    5
    Views:
    395
Loading...

Share This Page