mechanize get methods returns both ::File and ::Page

Discussion in 'Ruby' started by Carl Bernardi, Feb 3, 2008.

  1. Hi,

    I am having some problems with WWW::Mechanize. When I use the get(url)
    method it unpredictably returns either WWW::Mechanize::File or
    WWW::Mechanize::page. Since it's a HTML page that I am downloading, I
    need to always need it to return a page and not a file for what I am
    doing. The content type for this page is "text/plain" which I think is
    part of the problem which might have something to do with it.

    I am looking for a way to guarantee the method returning a page, a
    solution for getting the page class from the file class, or how to cast
    a file to a page.

    Thank,

    Carl


    http://www.gaihosa.com
    --
    Posted via http://www.ruby-forum.com/.
     
    Carl Bernardi, Feb 3, 2008
    #1
    1. Advertising

  2. Carl Bernardi

    7stud -- Guest

    Carl Bernardi wrote:
    > Hi,
    >
    > I am having some problems with WWW::Mechanize. When I use the get(url)
    > method it unpredictably returns either WWW::Mechanize::File or
    > WWW::Mechanize::page.


    > Since it's a HTML page that I am downloading, I
    > need to always need it to return a page and not a file for what I am
    > doing. The content type for this page is "text/plain" which I think is
    > part of the problem which might have something to do with it.
    >


    >>Class WWW::Mechanize::page
    >>Synopsis
    >>This class encapsulates an HTML page. If Mechanize finds a content type of ‘text/html’, this class >>will be instantiated and returned.
    >>


    Presumably that means if the content type is not text/*html*, then a
    Page will not be returned. That makes sense since the synopsis says
    that a Page encapsulates an *HTML* page.

    >>WWW::Mechanize::File
    >>If Mechanize cannot find an appropriate class to use for the content type, this class will be used. For >>example, if you download a JPG, Mechanize will not know how to parse it, so this class will be >>instantiated.
    >>


    Since Mechanize is used to parse forms and html, that makes sense: if
    you don't have an html page(i.e. one with a Content-Type = text/*html*),
    then you can't parse it as html.



    >The content type for this page is "text/plain" which I think is
    >part of the problem which might have something to do with it.


    A page with a content type of 'text/plain' is telling you that the page
    is not html. Are you saying that the page is actually html even though
    the page says that it does not contain html?
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Feb 3, 2008
    #2
    1. Advertising


  3. >>The content type for this page is "text/plain" which I think is
    >>part of the problem which might have something to do with it.

    >
    > A page with a content type of 'text/plain' is telling you that the page
    > is not html. Are you saying that the page is actually html even though
    > the page says that it does not contain html?


    The page is html. Below, I included the log. It shows the page's
    content type to be "text/html" for the first few attempts and then the
    last attempt to be "text/plain". All I need to know is how to get a
    page instead of a file either be extending Mechanize, creating a
    instance of WWW::Mechanize::page with the body from the file object or
    some other method as I need to get the links.

    Any ideas?

    # Logfile created on Sun Feb 03 18:20:36 -0500 2008 by logger.rb/1.5.2.9
    I, [2008-02-03T18:20:36.381042 #15528] INFO -- : Net::HTTP::Get:
    /menus.htm
    D, [2008-02-03T18:20:36.478723 #15528] DEBUG -- : request-header:
    accept-language => en-us,en;q=0.5
    D, [2008-02-03T18:20:36.478919 #15528] DEBUG -- : request-header:
    connection => keep-alive
    D, [2008-02-03T18:20:36.479000 #15528] DEBUG -- : request-header: accept
    => */*
    D, [2008-02-03T18:20:36.479073 #15528] DEBUG -- : request-header:
    accept-encoding => gzip,identity
    D, [2008-02-03T18:20:36.479147 #15528] DEBUG -- : request-header:
    user-agent => Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)
    AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
    D, [2008-02-03T18:20:36.479221 #15528] DEBUG -- : request-header:
    accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7
    D, [2008-02-03T18:20:36.479295 #15528] DEBUG -- : request-header:
    keep-alive => 300
    D, [2008-02-03T18:20:36.511382 #15528] DEBUG -- : Read 605 bytes
    D, [2008-02-03T18:20:36.516205 #15528] DEBUG -- : Read 1141 bytes
    D, [2008-02-03T18:20:36.516409 #15528] DEBUG -- : response-header:
    last-modified => Sat, 17 Feb 2007 23:40:30 GMT
    D, [2008-02-03T18:20:36.516486 #15528] DEBUG -- : response-header:
    connection => Keep-Alive
    D, [2008-02-03T18:20:36.516559 #15528] DEBUG -- : response-header:
    content-type => text/html
    D, [2008-02-03T18:20:36.516631 #15528] DEBUG -- : response-header: etag
    => "4688-475-9e16f780", "4688-475-9e16f780"
    D, [2008-02-03T18:20:36.516702 #15528] DEBUG -- : response-header: date
    => Sun, 03 Feb 2008 23:24:11 GMT
    D, [2008-02-03T18:20:36.516773 #15528] DEBUG -- : response-header:
    server => Apache-AdvancedExtranetServer
    D, [2008-02-03T18:20:36.516845 #15528] DEBUG -- : response-header:
    content-length => 1141
    D, [2008-02-03T18:20:36.516918 #15528] DEBUG -- : response-header:
    keep-alive => timeout=15, max=100
    D, [2008-02-03T18:20:36.516990 #15528] DEBUG -- : response-header:
    accept-ranges => bytes, bytes
    I, [2008-02-03T18:20:36.517359 #15528] INFO -- : status: 200
    I, [2008-02-03T18:21:40.578768 #15591] INFO -- : Net::HTTP::Get:
    /menus.htm
    D, [2008-02-03T18:21:40.704310 #15591] DEBUG -- : request-header:
    accept-language => en-us,en;q=0.5
    D, [2008-02-03T18:21:40.704504 #15591] DEBUG -- : request-header:
    connection => keep-alive
    D, [2008-02-03T18:21:40.704582 #15591] DEBUG -- : request-header: accept
    => */*
    D, [2008-02-03T18:21:40.704657 #15591] DEBUG -- : request-header:
    accept-encoding => gzip,identity
    D, [2008-02-03T18:21:40.704732 #15591] DEBUG -- : request-header:
    user-agent => Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)
    AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
    D, [2008-02-03T18:21:40.704806 #15591] DEBUG -- : request-header:
    accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7
    D, [2008-02-03T18:21:40.704879 #15591] DEBUG -- : request-header:
    keep-alive => 300
    D, [2008-02-03T18:21:40.740010 #15591] DEBUG -- : Read 681 bytes
    D, [2008-02-03T18:21:40.740522 #15591] DEBUG -- : Read 1141 bytes
    D, [2008-02-03T18:21:40.740674 #15591] DEBUG -- : response-header:
    last-modified => Sat, 17 Feb 2007 23:40:30 GMT
    D, [2008-02-03T18:21:40.740755 #15591] DEBUG -- : response-header:
    connection => Keep-Alive
    D, [2008-02-03T18:21:40.740829 #15591] DEBUG -- : response-header:
    content-type => text/html
    D, [2008-02-03T18:21:40.740904 #15591] DEBUG -- : response-header: etag
    => "4688-475-9e16f780", "4688-475-9e16f780"
    D, [2008-02-03T18:21:40.740978 #15591] DEBUG -- : response-header: date
    => Sun, 03 Feb 2008 23:25:15 GMT
    D, [2008-02-03T18:21:40.741053 #15591] DEBUG -- : response-header:
    server => Apache-AdvancedExtranetServer
    D, [2008-02-03T18:21:40.741127 #15591] DEBUG -- : response-header:
    content-length => 1141
    D, [2008-02-03T18:21:40.741200 #15591] DEBUG -- : response-header:
    keep-alive => timeout=15, max=100
    D, [2008-02-03T18:21:40.741273 #15591] DEBUG -- : response-header:
    accept-ranges => bytes, bytes
    I, [2008-02-03T18:21:40.741640 #15591] INFO -- : status: 200
    I, [2008-02-03T18:21:44.596803 #15596] INFO -- : Net::HTTP::Get:
    /menus.htm
    D, [2008-02-03T18:21:44.664035 #15596] DEBUG -- : request-header:
    accept-language => en-us,en;q=0.5
    D, [2008-02-03T18:21:44.664264 #15596] DEBUG -- : request-header:
    connection => keep-alive
    D, [2008-02-03T18:21:44.664345 #15596] DEBUG -- : request-header: accept
    => */*
    D, [2008-02-03T18:21:44.664417 #15596] DEBUG -- : request-header:
    accept-encoding => gzip,identity
    D, [2008-02-03T18:21:44.664488 #15596] DEBUG -- : request-header:
    user-agent => Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)
    AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
    D, [2008-02-03T18:21:44.664559 #15596] DEBUG -- : request-header:
    accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7
    D, [2008-02-03T18:21:44.664630 #15596] DEBUG -- : request-header:
    keep-alive => 300
    D, [2008-02-03T18:21:44.698991 #15596] DEBUG -- : Read 605 bytes
    D, [2008-02-03T18:21:44.701238 #15596] DEBUG -- : Read 1141 bytes
    D, [2008-02-03T18:21:44.701421 #15596] DEBUG -- : response-header:
    last-modified => Sat, 17 Feb 2007 23:40:30 GMT
    D, [2008-02-03T18:21:44.701496 #15596] DEBUG -- : response-header:
    connection => Keep-Alive
    D, [2008-02-03T18:21:44.701566 #15596] DEBUG -- : response-header:
    content-type => text/html
    D, [2008-02-03T18:21:44.701638 #15596] DEBUG -- : response-header: etag
    => "4688-475-9e16f780", "4688-475-9e16f780"
    D, [2008-02-03T18:21:44.701708 #15596] DEBUG -- : response-header: date
    => Sun, 03 Feb 2008 23:25:19 GMT
    D, [2008-02-03T18:21:44.701779 #15596] DEBUG -- : response-header:
    server => Apache-AdvancedExtranetServer
    D, [2008-02-03T18:21:44.701848 #15596] DEBUG -- : response-header:
    content-length => 1141
    D, [2008-02-03T18:21:44.701919 #15596] DEBUG -- : response-header:
    keep-alive => timeout=15, max=100
    D, [2008-02-03T18:21:44.702133 #15596] DEBUG -- : response-header:
    accept-ranges => bytes, bytes
    I, [2008-02-03T18:21:44.702519 #15596] INFO -- : status: 200
    I, [2008-02-03T18:21:46.272708 #15602] INFO -- : Net::HTTP::Get:
    /menus.htm
    D, [2008-02-03T18:21:46.332880 #15602] DEBUG -- : request-header:
    accept-language => en-us,en;q=0.5
    D, [2008-02-03T18:21:46.333074 #15602] DEBUG -- : request-header:
    connection => keep-alive
    D, [2008-02-03T18:21:46.333147 #15602] DEBUG -- : request-header: accept
    => */*
    D, [2008-02-03T18:21:46.333218 #15602] DEBUG -- : request-header:
    accept-encoding => gzip,identity
    D, [2008-02-03T18:21:46.333288 #15602] DEBUG -- : request-header:
    user-agent => Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)
    AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3
    D, [2008-02-03T18:21:46.333360 #15602] DEBUG -- : request-header:
    accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7
    D, [2008-02-03T18:21:46.333431 #15602] DEBUG -- : request-header:
    keep-alive => 300
    D, [2008-02-03T18:21:46.361484 #15602] DEBUG -- : Read 0 bytes
    D, [2008-02-03T18:21:46.362406 #15602] DEBUG -- : Read 948 bytes
    D, [2008-02-03T18:21:46.365163 #15602] DEBUG -- : Read 1141 bytes
    D, [2008-02-03T18:21:46.365336 #15602] DEBUG -- : response-header:
    last-modified => Sat, 17 Feb 2007 23:40:30 GMT
    D, [2008-02-03T18:21:46.365410 #15602] DEBUG -- : response-header:
    connection => Keep-Alive
    D, [2008-02-03T18:21:46.365481 #15602] DEBUG -- : response-header:
    content-type => text/plain
    D, [2008-02-03T18:21:46.368645 #15602] DEBUG -- : response-header: etag
    => "4688-475-9e16f780"
    D, [2008-02-03T18:21:46.368781 #15602] DEBUG -- : response-header: date
    => Sun, 03 Feb 2008 23:25:21 GMT
    D, [2008-02-03T18:21:46.368855 #15602] DEBUG -- : response-header:
    server => Apache-AdvancedExtranetServer
    D, [2008-02-03T18:21:46.368927 #15602] DEBUG -- : response-header:
    content-length => 1141
    D, [2008-02-03T18:21:46.368998 #15602] DEBUG -- : response-header:
    keep-alive => timeout=15, max=100
    D, [2008-02-03T18:21:46.369070 #15602] DEBUG -- : response-header: age
    => 1
    D, [2008-02-03T18:21:46.369141 #15602] DEBUG -- : response-header:
    accept-ranges => bytes
    I, [2008-02-03T18:21:46.369512 #15602] INFO -- : status: 200
    --
    Posted via http://www.ruby-forum.com/.
     
    Carl Bernardi, Feb 4, 2008
    #3
  4. You have to set plugable paraser for text/plain to html parser
    here's hack (you'll have to change get to get_html):

    class WWW::Mechanize
    def get_html(url)
    old_parser= @pluggable_parser['text/plain']
    @pluggable_parser['text/plain']=@pluggable_parser['text/html']
    bdy = get(url)
    pluggable_parser['text/plain']=old_parser
    bdy
    end
    end

    and other way around (always get file for html)

    class WWW::Mechanize
    def get_file(url)
    old_parser= @pluggable_parser['text/html']
    @pluggable_parser['text/html']=::WWW::Mechanize::File
    bdy = get(url).body
    pluggable_parser['text/html']=old_parser
    bdy
    end
    end
     
    Marcin Raczkowski, Feb 4, 2008
    #4

  5. > Since Mechanize is used to parse forms and html, that makes sense: if
    > you don't have an html page(i.e. one with a Content-Type = text/*html*),
    > then you can't parse it as html.

    yes you can - use plugable parasers


    >
    >> The content type for this page is "text/plain" which I think is
    >> part of the problem which might have something to do with it.

    >
    > A page with a content type of 'text/plain' is telling you that the page
    > is not html. Are you saying that the page is actually html even though
    > the page says that it does not contain html?

    some webservers are fucked up, and simply don't care, or admins are
    retarded and serv everything as text/plain
     
    Marcin Raczkowski, Feb 4, 2008
    #5
  6. Carl Bernardi

    7stud -- Guest

    Carl Bernardi wrote:
    > The page is html. Below, I included the log. It shows the page's
    > content type to be "text/html" for the first few attempts and then the
    > last attempt to be "text/plain".


    I'm not sure how showing me the log files is evidence that even though
    the page says it is 'text/plain' that it really contains html.


    > All I need to know is how to get a
    > page instead of a file either be extending Mechanize, creating a
    > instance of WWW::Mechanize::page with the body from the file object


    Page#new() takes a URI as an argument. So it seems like you could save
    the file, and then provide a URI with the file:// scheme and create a
    new Page.
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Feb 4, 2008
    #6
  7. Carl Bernardi

    7stud -- Guest

    Marcin Raczkowski wrote:
    >> Since Mechanize is used to parse forms and html, that makes sense: if
    >> you don't have an html page(i.e. one with a Content-Type = text/*html*),
    >> then you can't parse it as html.

    > yes you can - use plugable parasers
    >


    Explain how you would parse plain text such as:

    Hi,

    My name is Sally.

    Yours Truly,
    Sally

    as html?? What's the <title>? Which part is a <form>?
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Feb 4, 2008
    #7
  8. 7stud -- wrote:
    > Marcin Raczkowski wrote:
    >>> Since Mechanize is used to parse forms and html, that makes sense: if
    >>> you don't have an html page(i.e. one with a Content-Type = text/*html*),
    >>> then you can't parse it as html.

    >> yes you can - use plugable parasers
    >>

    >
    > Explain how you would parse plain text such as:
    >
    > Hi,
    >
    > My name is Sally.
    >
    > Yours Truly,
    > Sally
    >
    > as html?? What's the <title>? Which part is a <form>?

    Did mama hit you in a head when you were young?

    Do you know what are MIME encodings?
    Servers are requred by http specification to provide mime-encoding - and
    content should be interpreted acording to it - if it's text/plain it
    should be just displayed if it's application/zip then saved etc.

    BUT since most servers don't implement it fully - or have to be
    configured - or php cgi script (or ruby for that matter) might alter it
    - and sometimes does - html can be served with mime-type text/plain

    since mechanized follows that standard - it assumes that data with mime
    text/plain is in fact plain text just like one you provided - but what
    if it's website (which Carl clearly explained in initial post) - then
    you have to force it to treat it as html - clear enough?
     
    Marcin Raczkowski, Feb 4, 2008
    #8
  9. 7stud -- wrote:
    > Carl Bernardi wrote:
    >> The page is html. Below, I included the log. It shows the page's
    >> content type to be "text/html" for the first few attempts and then the
    >> last attempt to be "text/plain".

    >
    > I'm not sure how showing me the log files is evidence that even though
    > the page says it is 'text/plain' that it really contains html.
    >
    >
    >> All I need to know is how to get a
    >> page instead of a file either be extending Mechanize, creating a
    >> instance of WWW::Mechanize::page with the body from the file object

    >
    > Page#new() takes a URI as an argument. So it seems like you could save
    > the file, and then provide a URI with the file:// scheme and create a
    > new Page.


    or - instead of doing that idiotic hack - you could use pluggable
    parasers - feature already built in into mechanize to force it to treat
    text/plain like html with simple one liner.

    Or use more complex solution that i posted - that forces pluggable
    paraser only if you clearly state it for that request using get_html,
    and after that it cleans up after itself
     
    Marcin Raczkowski, Feb 4, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. vizlab
    Replies:
    3
    Views:
    4,213
    Michael Bar-Sinai
    Oct 17, 2007
  2. ABCL
    Replies:
    0
    Views:
    554
  3. ++imanshu
    Replies:
    7
    Views:
    477
    ++imanshu
    Aug 23, 2008
  4. Kenneth McDonald
    Replies:
    5
    Views:
    324
    Kenneth McDonald
    Sep 26, 2008
  5. Rod Dik
    Replies:
    6
    Views:
    122
    Luis Lavena
    Jun 20, 2009
Loading...

Share This Page