Open URI & web scraping. Part II

Discussion in 'Ruby' started by Jean Nibee, Nov 13, 2007.

  1. Jean Nibee

    Jean Nibee Guest

    Hi

    (short form of a post I made yesterday that got no love, I suspect it'
    sbecuase I was long winded)

    Nutshell if I use open URI (and Hpricot) to download a web page and
    'scrape' all the images to write them to my local disk dynamic images
    always have improper format (Size 0) but static images are fine.

    Example would be : <img
    src="http://myserver:8080/Someservlet?name=blah&param=value&etc=etc">

    Whether I copy/paste this URL in another browser or use open URI to
    "get" the image I get an an error of:

    XML Parsing Error: no element found
    Location: http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
    Line Number 1, Column 1:

    BUT, this image is displayed PERFECTLY in the html.

    How can I get this image to download? (I suspect it's the mime type
    being set on the server side but I am not 100% sure)

    ***
    OUTPUT
    ***
    [[URI information...]]
    Fetched document:
    http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
    Content Type: application/voicexml+xml
    Charset:
    Content-Encoding:
    Last Modified:
    IMAGE INFO!!! ->
    Writing to file ::
    D:\sandbox\auto_attendant\archive_reports\trunk\dumps\1194882652_854.gif

    Thanks for your help.
    --
    Posted via http://www.ruby-forum.com/.
    Jean Nibee, Nov 13, 2007
    #1
    1. Advertising

  2. Jean Nibee

    Axel Etzold Guest

    -------- Original-Nachricht --------
    > Datum: Tue, 13 Nov 2007 22:21:14 +0900
    > Von: Jean Nibee <>
    > An:
    > Betreff: Open URI & web scraping. Part II


    > Hi
    >
    > (short form of a post I made yesterday that got no love, I suspect it'
    > sbecuase I was long winded)
    >
    > Nutshell if I use open URI (and Hpricot) to download a web page and
    > 'scrape' all the images to write them to my local disk dynamic images
    > always have improper format (Size 0) but static images are fine.
    >
    > Example would be : <img
    > src="http://myserver:8080/Someservlet?name=blah&param=value&etc=etc">
    >
    > Whether I copy/paste this URL in another browser or use open URI to
    > "get" the image I get an an error of:
    >
    > XML Parsing Error: no element found
    > Location: http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
    > Line Number 1, Column 1:
    >
    > BUT, this image is displayed PERFECTLY in the html.
    >
    > How can I get this image to download? (I suspect it's the mime type
    > being set on the server side but I am not 100% sure)
    >
    > ***
    > OUTPUT
    > ***
    > [[URI information...]]
    > Fetched document:
    > http://myserver:8080/Someservlet?name=blah&param=value&etc=etc
    > Content Type: application/voicexml+xml
    > Charset:
    > Content-Encoding:
    > Last Modified:
    > IMAGE INFO!!! ->
    > Writing to file ::
    > D:\sandbox\auto_attendant\archive_reports\trunk\dumps\1194882652_854.gif
    >
    > Thanks for your help.
    > --
    > Posted via http://www.ruby-forum.com/.


    Dear Jean,

    maybe you can use ruby's rio (http://rio.rubyforge.org/) to download
    an entire website. I'm thinking in particular of the examples
    given in
    http://rio.rubyforge.org/classes/RIO/Doc/INTRO.html under the
    headers

    "Creating a Rio that refers to a web page" and
    "Creating a Rio that refers to a file or directory on a FTP server".

    Otherwise, maybe you get better responses on the Rails mailing list ?

    Best regards,

    Axel







    --
    Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
    Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
    Axel Etzold, Nov 13, 2007
    #2
    1. Advertising

  3. Jean Nibee

    Jean Nibee Guest

    Axel Etzold wrote:
    >
    > Dear Jean,
    >
    > maybe you can use ruby's rio (http://rio.rubyforge.org/) to download
    > an entire website. I'm thinking in particular of the examples
    > given in
    > http://rio.rubyforge.org/classes/RIO/Doc/INTRO.html under the
    > headers
    >
    > "Creating a Rio that refers to a web page" and
    > "Creating a Rio that refers to a file or directory on a FTP server".
    >
    > Otherwise, maybe you get better responses on the Rails mailing list ?
    >
    > Best regards,
    >
    > Axel


    Same issue with RIO (albeit a little more complex to get thae page and
    parse it as I"m doing w/ OpenURI / HPricot.)

    I didn't post to rails since this isn't using the rails framework, but,
    maybe they do more web work that it will clue them into an issue I'm
    missing.

    Thanks for your reply and help!
    --
    Posted via http://www.ruby-forum.com/.
    Jean Nibee, Nov 13, 2007
    #3
  4. Jean Nibee

    Peter Szinek Guest

    Hi Jean,

    > Same issue with RIO (albeit a little more complex to get thae page and
    > parse it as I"m doing w/ OpenURI / HPricot.)


    What does an aggressive wget (i.e. with grab everything options) do?

    Cheers,
    Peter
    ___
    http://www.rubyrailways.com
    http://scrubyt.org
    Peter Szinek, Nov 13, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Simon Harris
    Replies:
    0
    Views:
    6,342
    Simon Harris
    May 10, 2005
  2. David Jones

    Web Scraping/Site Scraping

    David Jones, Jul 11, 2004, in forum: Python
    Replies:
    4
    Views:
    491
    Andrew Bennetts
    Jul 13, 2004
  3. Jean Nibee

    Open URI and web scraping...

    Jean Nibee, Nov 12, 2007, in forum: Ruby
    Replies:
    0
    Views:
    115
    Jean Nibee
    Nov 12, 2007
  4. Jay 99
    Replies:
    2
    Views:
    181
    Jay 99
    Apr 4, 2009
  5. Tao Ji
    Replies:
    1
    Views:
    190
    pharrington
    Feb 4, 2010
Loading...

Share This Page