Tomcat 5.5 and web scrapping

Discussion in 'Java' started by Fran Cottone, Mar 22, 2005.

  1. Fran Cottone

    Fran Cottone Guest

    I've had success "web scrapping" ASPs using the URLConnection object,
    but get no luck with servlets. When I invoke the servlet from a
    browser there's no problem, but when I invoke it from code nothing
    happens. As I said, I have no problems opening connections to ASPs and
    sending them parameters and reading back the output.

    Does Tomcat 5.5 treat web clients differently to IIS? I'd have though
    that there'd be no difference i.e. the web server shouldn't care who
    is talking to it as long as it obeys the protocols.

    Many thanks in advance.
    Fran Cottone, Mar 22, 2005
    1. Advertisements

  2. Fran Cottone

    Roland Guest

    I don't think Tomcat treats web clients differently than others.
    However, a web application (hosted by Tomcat) may have specific needs to
    function properly, for instance using the HTTP POST method instead of
    HTTP GET when submitting a form. AFAIK using URLConnection does not give
    you much control on how a page is retrieved.

    You can use URLConnection's subclass HttpURLConnection to specify the
    request method (GET, POST, and other HTTP methods) and set additonal
    request headers.

    Further, Apache's Commons Net (<>)
    and Commons HTTP Client
    (<>) libraries might
    provide you with much more functionality for scraping the web.


    Roland de Ruiter
    ___ ___
    /__/ w_/ /__/
    / \ /_/ / \
    Roland, Mar 22, 2005
    1. Advertisements

  3. Fran Cottone

    Simon Shearn Guest

    Hello -

    A few other things that sometimes cause screenscrapers to fail where
    browsers succeed:
    - Does the servlet expect cookie handling behaviour by the client?
    - Does the servlet expect to receive a referer header?
    - Is there any kind of authentication in use?
    - Is the scraper using (or not using) the appropriate proxy?
    - Is the servlet doing some kind of redirect or returning one of the less
    common response codes, rather than the usual 200 code?

    In these situations it is helpful to look at the text of the HTTP requests
    and responses, which (last time I checked) is difficult to do with
    URLConnection and HttpURLConnection. If you want fine control, then Apache
    HttpClient, or as a last resort writing your own HTTP client using sockets,
    may be necessary.


    Simon Shearn, Mar 24, 2005
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.