download a web site... or a page of it...

S

SpreadTooThin

I want to download a website and store it in an html file.
I don't want to recurs all the links that might be in the site, but i
do want all the images and formatting that are displayed just as when
you visit the sight.

I think it should be as simple as:

URL url = new URL("http://example.com");

HttpURLConnection conn = (HttpURLConnection) url.openConnection();

BufferedReader rd = new BufferedReader(new InputStreamReader

(conn.getInputStreamReader());

String line;
while(line=rd.readLine() != null)
{
response += line;
}


Will this work or is there more to be done?
 
L

Lothar Kimmeringer

SpreadTooThin said:
I want to download a website and store it in an html file.

I don't want to recurs all the links that might be in the site, but i
do want all the images and formatting that are displayed just as when
you visit the sight.

I think it should be as simple as:

URL url = new URL("http://example.com");

HttpURLConnection conn = (HttpURLConnection) url.openConnection();

BufferedReader rd = new BufferedReader(new InputStreamReader

(conn.getInputStreamReader());

String line;
while(line=rd.readLine() != null)
{
response += line;
}


Will this work or is there more to be done?

There is more to do because this will only give you the
HTML-page itself which is in today's life just a single
part. In addition to that you often need CSS-files, Javascript-
files etc.

If you're able to read C-Sources you might check out
http://www.httrack.com/proxytrack/
which is Open Source and allows you to have a look into the
internals of the program. If you really just want to copy
a webpage and you don't need to do an implementation, just
select File->Save Page As... if you're using Firefox. This
will do exactly what you want to do. Or use the above linked
program that can do that for one page or a complete website.


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
A

Arne Vajhøj

SpreadTooThin said:
I want to download a website and store it in an html file.
I don't want to recurs all the links that might be in the site, but i
do want all the images and formatting that are displayed just as when
you visit the sight.

I think it should be as simple as:

URL url = new URL("http://example.com");

HttpURLConnection conn = (HttpURLConnection) url.openConnection();

BufferedReader rd = new BufferedReader(new InputStreamReader

(conn.getInputStreamReader());

String line;
while(line=rd.readLine() != null)
{
response += line;
}


Will this work or is there more to be done?

You will need to parse the returned HTML possible using regex
and retrieve the images as well (note that those are binary and
should not be read with BufferedReader).

If the sites require login etc. then you will find Jakarta HttpClient
easier than (Http)URLConnection.

Arne

PS: If you download really huge HTML file you should be using
StringBuilder instead of appending to a String.
 
S

SpreadTooThin

You will need to parse the returned HTML possible using regex
and retrieve the images as well (note that those are binary and
should not be read with BufferedReader).

If the sites require login etc. then you will find Jakarta HttpClient
easier than (Http)URLConnection.

Arne

PS: If you download really huge HTML file you should be using
     StringBuilder instead of appending to a String.

Can I treat the returned html like an xml document?
 
T

Tom Anderson

Unless it is actually XHTML then no.

My suggestion would be to use HtmlUnit, which will turn HTML, even bad
HTML, into something sufficiently XML-like (ie a DOM tree) that it becomes
tractable:

http://htmlunit.sourceforge.net/

That will download and parse the HTML, and you can then query to find all
the image elements and download their sources. Like this:

WebClient client = new WebClient(BrowserVersion.FIREFOX_3);
HtmlPage page = client.getPage("http://example.com");
for (Object obj: page.getByXPath("//img")) {
HtmlImage img = (HtmlImage)obj;
// there are various ways you could save the image data - here's a simple but inefficient one:
byte[] imgData = img.getWebResponse().getContentAsBytes();
OutputStream out = new FileOutputStream(img.getSrcAttribute()); // this might not be a good idea
out.write(imgData);
out.close();
}

You can also find all the links to other pages using similar logic, which
simplifies spidering.

I would definitely not try to do this using regexps or a handmade parser -
that would be reinventing the wheel, and there are too many corner cases
for it to be an easy job.

tom
 
R

Roedy Green

Can I treat the returned html like an xml document?

Typical HTML is crawling with syntax errors. IE's forgiveness of
errors has encouraged ever more sloppiness.

see http://mindprod.com/jgloss/screenscraping.html
for some tips on how to handle it, including TagSoup.
--
Roedy Green Canadian Mind Products
http://mindprod.com

One path leads to despair and utter hopelessness. The other,
to total extinction. Let us pray we have the wisdom to choose correctly.
~ Woody Allen .
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top