Tomcat 5.5 and web scrapping

F

Fran Cottone

I've had success "web scrapping" ASPs using the URLConnection object,
but get no luck with servlets. When I invoke the servlet from a
browser there's no problem, but when I invoke it from code nothing
happens. As I said, I have no problems opening connections to ASPs and
sending them parameters and reading back the output.

Does Tomcat 5.5 treat web clients differently to IIS? I'd have though
that there'd be no difference i.e. the web server shouldn't care who
is talking to it as long as it obeys the protocols.

Many thanks in advance.
 
R

Roland

I've had success "web scrapping" ASPs using the URLConnection object,
but get no luck with servlets. When I invoke the servlet from a
browser there's no problem, but when I invoke it from code nothing
happens. As I said, I have no problems opening connections to ASPs and
sending them parameters and reading back the output.

Does Tomcat 5.5 treat web clients differently to IIS? I'd have though
that there'd be no difference i.e. the web server shouldn't care who
is talking to it as long as it obeys the protocols.

Many thanks in advance.

I don't think Tomcat treats web clients differently than others.
However, a web application (hosted by Tomcat) may have specific needs to
function properly, for instance using the HTTP POST method instead of
HTTP GET when submitting a form. AFAIK using URLConnection does not give
you much control on how a page is retrieved.

You can use URLConnection's subclass HttpURLConnection to specify the
request method (GET, POST, and other HTTP methods) and set additonal
request headers.

Further, Apache's Commons Net (<http://jakarta.apache.org/commons/net/>)
and Commons HTTP Client
(<http://jakarta.apache.org/commons/httpclient/>) libraries might
provide you with much more functionality for scraping the web.

--
Regards,

Roland de Ruiter
___ ___
/__/ w_/ /__/
/ \ /_/ / \
 
S

Simon Shearn

Roland said:
I don't think Tomcat treats web clients differently than others.
However, a web application (hosted by Tomcat) may have specific needs to
function properly, for instance using the HTTP POST method instead of
HTTP GET when submitting a form. AFAIK using URLConnection does not give
you much control on how a page is retrieved.

You can use URLConnection's subclass HttpURLConnection to specify the
request method (GET, POST, and other HTTP methods) and set additonal
request headers.

Further, Apache's Commons Net (<http://jakarta.apache.org/commons/net/>)
and Commons HTTP Client
(<http://jakarta.apache.org/commons/httpclient/>) libraries might
provide you with much more functionality for scraping the web.

--
Regards,

Roland de Ruiter
___ ___
/__/ w_/ /__/
/ \ /_/ / \

Hello -

A few other things that sometimes cause screenscrapers to fail where
browsers succeed:
- Does the servlet expect cookie handling behaviour by the client?
- Does the servlet expect to receive a referer header?
- Is there any kind of authentication in use?
- Is the scraper using (or not using) the appropriate proxy?
- Is the servlet doing some kind of redirect or returning one of the less
common response codes, rather than the usual 200 code?

In these situations it is helpful to look at the text of the HTTP requests
and responses, which (last time I checked) is difficult to do with
URLConnection and HttpURLConnection. If you want fine control, then Apache
HttpClient, or as a last resort writing your own HTTP client using sockets,
may be necessary.

Regards,

Simon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top