JSP or httpservlet for Java spider?

G

Greg Peters

Hi. I want to spider just a few websites, not the entire site, just 1 or 2
levels deep. So can I use JSP or httpservlets for this? Does anyone know of
some tutorial/code/book that explains this? I usually use JSP and
httpservlets for processing requests, but I want to get the data from a
different website.

Or do I have to spider using perl, then store it in a database and retrieve
it using JSP/httpservlets? Thank you.
 
R

Roedy Green

Hi. I want to spider just a few websites, not the entire site, just 1 or 2
levels deep. So can I use JSP or httpservlets for this? Does anyone know of
some tutorial/code/book that explains this? I usually use JSP and
httpservlets for processing requests, but I want to get the data from a
different website.

see http://mindprod.com/applets/fileio.htm
for how to do GET.

Then you have to find the links to spider e.g.

with pattern
<a href="xxxx"

you can crudely use indexOf "<a href="
or you can use a regex if you want to catch squirrelly stuff like
extra spaces or parms.

See http://mindprod.com/jgloss/regex.html

You add the links to a queue of links to be spidered.
See http://mindprod.com/queue.html

Then you spawn up to N threads that grab the next queue items and
spider it.

See http://mindprod.com/projects/htmlbrokenlink.html
for more details.
 
J

John C. Bollinger

Greg said:
Hi. I want to spider just a few websites, not the entire site, just 1 or 2
levels deep. So can I use JSP or httpservlets for this? Does anyone know of
some tutorial/code/book that explains this? I usually use JSP and
httpservlets for processing requests, but I want to get the data from a
different website.

Or do I have to spider using perl, then store it in a database and retrieve
it using JSP/httpservlets? Thank you.

JSP and servlets are mechanisms for generating dynamic responses to HTTP
requests. They are most often used for serving HTML pages. They have
no special mechanism beyond any other Java code for making
general-purpose HTTP requests are doing anything with the results of
such a request.

Even though JSP and servlets specifically would be inappropriate choices
for a web spider, that does not mean that Java in general is wrong for
the task. To the contrary, the Java platform library has good support
for a wide variety of network- and web-oriented tasks, and there are a
multitude of 3rd party libraries that build further on that foundation.
Look at the URL, URLConnection, and HttpURLConnection classes in the
java.net package to start, and perhaps at DOM (package org.w3c.dom) for
document analysis. You might also find the Jakarta HTTP Client library
useful: http://jakarta.apache.org/commons/httpclient/ There are many
other resources available.

As for displaying pages previously retrieved by your spider, chances are
that a fairly simple servlet could handle the job admirably. There
might be reasons to do it with JSP / custom tags instead, but that
approach wouldn't be my first inclination.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top