LWP Doesn't Seem To Save Cookies:

Discussion in 'Perl Misc' started by Hal Vaughan, Mar 23, 2005.

  1. Hal Vaughan

    Hal Vaughan Guest

    I'm trying to write a scraper for a website that uses cookies. The short of
    it is that I keep getting their "You have to set your browser to allow
    cookies" message. The code for the full scraper is a bit much, so here are
    the relevant sections:

    use File::Spec::Functions;
    use File::Basename;
    use File::Copy;
    use LWP::UserAgent;
    use HTTP::Cookies;
    use URI::WithBase;
    use DBI;
    use strict;

    Here's where I set up the variables (not all "my" and "our" statements are
    included):

    print "Cookie file: $cfile\n";
    $ua = LWP::UserAgent->new;
    $ua->timeout(5);
    $ua->agent("Netscape/7.1");
    $cjar = HTTP::Cookies->new(file =>$cfile, autosave => 1, ignore_discard =>
    1);
    $ua->cookie_jar($cjar);

    Here's where I get the login page (which I always retrieve to make sure the
    fields or info hasn't changed):


    $page = $ua->get($url);
    $page = $page->as_string;

    And after that, I go through the page, make sure the form input fields
    haven't changed (which are "login" and "key" for the username and
    password). Then I post the data for the next page, including the form
    data:


    $parm = "";
    foreach (keys %form) {
    print "\tAdding parm. Key: $_, Value: $form{$_}\n";
    $parm = "$parm$_=$form{$_}&";
    }
    $parm =~ s/&$//;
    $req = HTTP::Request->new(POST => $url);
    $req->content_type("application/x-www-form-urlencoded");
    $req->header('Accept' => 'text/html');
    $req->content_type("form-data");
    $req->content($parm);
    $page = $ua->request($req);

    When I'm building up $parm, I'm taking the values from %form. I TRIED to
    use the hash to post the values, using "$page = $ua->post($url, \%form);",
    but even though it worked on a test web server on my LAN, it wouldn't work
    on the system I'm scraping (don't know why -- if you can help here as well,
    feel free to chip in).

    The problem comes up when I use the code above to post the form data and get
    the next page. The next page is a frameset with two frames. I get the
    frame urls from the page and load them:

    $req = HTTP::Request->new(GET => $url);
    $req->content_type("application/x-www-form-urlencoded");
    $req = $ua->request($req);
    $page = $req->as_string;

    And this is when I always get the "You don't have cookies" message.

    I thought that LWP automatically took the cookies out of the page (I also
    thought cookies were in the header, the one here is set with
    document.cookie="doc cookie" within the document), and stored them in the
    cookie jar automatically. That doesn't seem to be happening. I've been
    reading the perldocs, but I can't see anything in the response object that
    allows me to check the page for cookies, so I can do it myself.

    So why aren't the cookies being kept and why can't the pages I retrieve
    AFTER the cookie is set? Is part of the problem because they are in
    frames?

    Any help on this is appreciated.

    Thanks!

    Hal
    Hal Vaughan, Mar 23, 2005
    #1
    1. Advertising

  2. Hal Vaughan

    Todd W Guest

    "Hal Vaughan" <> wrote in message
    news:...
    > I'm trying to write a scraper for a website that uses cookies. The short

    of
    > it is that I keep getting their "You have to set your browser to allow
    > cookies" message. The code for the full scraper is a bit much, so here

    are
    > the relevant sections:
    >

    <snip />

    I've had a lot of sucess using LWP to scrape web pages, for instance I have
    a neat program that shows me all my bank account balances on my web enabled
    cell phone, but Ive had some trouble getting LWP to scrape some pages that
    required cookies also.

    Heres my code:

    [trwww[at]waveright temp]$ perl -MWWW::Mechanize::Shell -e 'shell'
    >get https://www.setsivr.odjfs.state.oh.us/welcome.asp

    Retrieving https://www.setsivr.odjfs.state.oh.us/welcome.asp(200)
    https://www.setsivr.odjfs.state.oh.us/cookieerror.htm>

    If the client and the server were doing everything according to
    specification, this would work.

    I get the same problem with lynx, and another poster on perl.libwww verified
    my issue, and also got the same error using a python http library.

    Heres the archive of my thread:

    http://groups-beta.google.com/group/perl.libwww/browse_thread/thread/38d09ffd6ff2f4fd

    I guess that since it dosent work with lynx I can say that the server is
    doing something that isnt standard, but it sucks beause it works fine on any
    of the major graphical browsers I've tried.

    I suppose that someone who knew http well enough could say why it dosent
    work, but I know it pretty well and I cant figure it out, and I've tried
    pretty hard.

    Todd W
    Todd W, Mar 23, 2005
    #2
    1. Advertising

  3. Hal Vaughan

    Hal Vaughan Guest

    Todd W wrote:

    >
    > "Hal Vaughan" <> wrote in message
    > news:...
    >> I'm trying to write a scraper for a website that uses cookies. The short

    > of
    >> it is that I keep getting their "You have to set your browser to allow
    >> cookies" message. The code for the full scraper is a bit much, so here

    > are
    >> the relevant sections:
    >>

    > <snip />
    >
    > I've had a lot of sucess using LWP to scrape web pages, for instance I
    > have a neat program that shows me all my bank account balances on my web
    > enabled cell phone, but Ive had some trouble getting LWP to scrape some
    > pages that required cookies also.
    >
    > Heres my code:
    >
    > [trwww[at]waveright temp]$ perl -MWWW::Mechanize::Shell -e 'shell'
    >>get https://www.setsivr.odjfs.state.oh.us/welcome.asp

    > Retrieving https://www.setsivr.odjfs.state.oh.us/welcome.asp(200)
    > https://www.setsivr.odjfs.state.oh.us/cookieerror.htm>
    >
    > If the client and the server were doing everything according to
    > specification, this would work.
    >
    > I get the same problem with lynx, and another poster on perl.libwww
    > verified my issue, and also got the same error using a python http
    > library.
    >
    > Heres the archive of my thread:
    >
    >

    http://groups-beta.google.com/group/perl.libwww/browse_thread/thread/38d09ffd6ff2f4fd

    I checked the thread, and I've gone back over the pages I downloaded. I
    wasn't clear (I think I mentioned it in my first post) about how cookies
    are normally handled, and had not looked closely at the files (since I
    figured that was not likely the problem). It turns out that the cookie IS
    being set in Javascript, which I suspected, but didn't realize this is a
    problem. I wrote out a routine that scanned the page, grabbed the cookie,
    and set it manually with $cookie_jar->set_cookie(), and it looks like it is
    set properly (it includes the domain and path setting, as well). However,
    even after setting the cookie manually, I either get "no cookie" messages,
    or trying to load any page after the login gives me the login page again
    (which I noticed happens in Firefox if I try to paste in a link to a page
    after the login page when I'm not logged in). (I also looked at the
    cookies in Firefox to see if it looked like the same ones I was getting in
    Perl, and they seem the same except for the session ID number.)

    So I've found a way to set the cookie by hand, but the server I'm trying to
    read from doesn't seem to see the cookie is set. Is there something I need
    to do, other than setting a cookie, to make sure the server I'm connecting
    to knows the cookie is set?

    This is not an area I'm an expert in, and it's frustrating because I need to
    get this done, so I'm low on sleep, and trying to put together a lot more
    pieces than I expected in this. I didn't know, when I sent a page request
    to a server, that the server could actually read the cookie with the
    request, I thought cookies were only used by client side Java, but the fact
    that the server won't send me the right pages without the cookie seems to
    say the server can read the cookie. Is that right? If so, how do I make
    sure the server gets the cookie?

    Thanks for any help on this!

    Hal
    Hal Vaughan, Mar 23, 2005
    #3
  4. Hal Vaughan wrote:
    > I thought that LWP automatically took the cookies out of the page (I also
    > thought cookies were in the header, the one here is set with
    > document.cookie="doc cookie" within the document), and stored them in the
    > cookie jar automatically. That doesn't seem to be happening. I've been
    > reading the perldocs, but I can't see anything in the response object that
    > allows me to check the page for cookies, so I can do it myself.


    This thread with a similar topic might contain something useful:

    http://groups-beta.google.com/group/comp.lang.perl.misc/browse_frm/thread/f8f4b9ef0d73a11d

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Mar 23, 2005
    #4
  5. Hal Vaughan

    Hal Vaughan Guest

    Gunnar Hjalmarsson wrote:

    > Hal Vaughan wrote:
    >> I thought that LWP automatically took the cookies out of the page (I also
    >> thought cookies were in the header, the one here is set with
    >> document.cookie="doc cookie" within the document), and stored them in the
    >> cookie jar automatically. That doesn't seem to be happening. I've been
    >> reading the perldocs, but I can't see anything in the response object
    >> that allows me to check the page for cookies, so I can do it myself.

    >
    > This thread with a similar topic might contain something useful:
    >
    >

    http://groups-beta.google.com/group/comp.lang.perl.misc/browse_frm/thread/f8f4b9ef0d73a11d
    >


    Thanks. I read through it. I already have the ignore_discard set, so that
    isn't it.

    At this point, I think it's a bigger problem and I could use some
    clarification from anyone (I'm trying to find info on Google, but am not
    doing too well). It turns out the cookie is set by Javascript, with
    "document.cookie=". Since Perl doesn't catch this, I'm pulling the cookie
    out with a regex and setting it manually. That doesn't seem to help
    though, so I've got some more questions:

    1) If I have an HTTP::Response object, and I pull out the Javascript cookie
    string, is there a way to add it to the header in the Response object and
    re-parse the Response to get the cookie into the jar, or will that make a
    difference over me setting the cookie manually?

    2) How does the server know what my cookies are? I had no idea that the
    server was able to read cookies, but since I get different pages without
    the cookie than what I should get, I think the server has a way of
    detecting the cookies on my system.

    3) If I'm right, and the server can read my cookies (other than reading them
    with client-side Javascript, which was what I used to think happened), is
    it worth sending the cookie as POST data instead?

    If anyone can help me with these, it'll be a huge help.

    Thanks!

    Hal
    Hal Vaughan, Mar 23, 2005
    #5
  6. Hal Vaughan wrote:
    > Gunnar Hjalmarsson wrote:
    >> This thread with a similar topic might contain something useful:
    >>
    >> http://groups-beta.google.com/group/comp.lang.perl.misc/browse_frm/thread/f8f4b9ef0d73a11d

    >
    > Thanks. I read through it. I already have the ignore_discard set, so that
    > isn't it.


    I knew that you have ignore_discard set; my thought was that other
    details in Richard's code might serve as clues.

    I have no own experience from using HTTP::Cookies, but when helping
    Richard, I noticed that the module provides quite a few methods, of
    which some appear to be relevant to you.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Mar 23, 2005
    #6
  7. Hal Vaughan <> wrote on 2005-03-23:
    [snip]
    > even after setting the cookie manually, I either get "no cookie" messages,
    > or trying to load any page after the login gives me the login page again
    > (which I noticed happens in Firefox if I try to paste in a link to a page
    > after the login page when I'm not logged in).


    It looks like the server might be checking the Referer header. You
    may want to try to include one in every request you make, like this:

    my $res = $ua->get($url, Referer => $ref);

    where $ref is the URL of the page you got $url from. (It might be
    enough just to give any URL from the same site, but then again, it
    might not.)

    A server paranoid enough to do things like that may also be checking
    User-Agent headers, so if you're not doing that already, I'd suggest
    setting yours to imitate some common browser, like this:

    $ua->agent('Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)');

    --
    Ilmari Karonen
    To reply by e-mail, please replace ".invalid" with ".net" in address.
    Ilmari Karonen, Mar 28, 2005
    #7
  8. Hal Vaughan

    Joe Smith Guest

    Hal Vaughan wrote:

    > At this point, I think it's a bigger problem and I could use some
    > clarification from anyone


    Last time I had a problem like this, I told my browser to use an
    http proxy, and had the proxy log what was actually being sent to
    the server. I used http://www.inwap.com/mybin/miscunix/?tcp-proxy
    to do the logging when my proxy did not log everything I needed.
    -Joe

    P.S. I noticed that cookies are mentioned in
    http://search.cpan.org/~petdance/WWW-Mechanize-1.12/lib/WWW/Mechanize.pm
    Joe Smith, Apr 5, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hal Vaughan

    Not getting cookies in LWP

    Hal Vaughan, Mar 5, 2004, in forum: Perl
    Replies:
    1
    Views:
    589
    Erik de Mare
    Mar 7, 2004
  2. Eric
    Replies:
    1
    Views:
    2,060
    Mark Fitzpatrick
    Dec 28, 2007
  3. _Who
    Replies:
    7
    Views:
    2,652
  4. Chris
    Replies:
    1
    Views:
    98
    Mina Naguib
    Jul 7, 2003
  5. Dan
    Replies:
    3
    Views:
    126
    Brian Wakem
    Aug 19, 2005
Loading...

Share This Page