download a web site... or a page of it...

SpreadTooThin · Feb 19, 2009

I want to download a website and store it in an html file.
I don't want to recurs all the links that might be in the site, but i
do want all the images and formatting that are displayed just as when
you visit the sight.

I think it should be as simple as:

URL url = new URL("http://example.com");

HttpURLConnection conn = (HttpURLConnection) url.openConnection();

BufferedReader rd = new BufferedReader(new InputStreamReader

(conn.getInputStreamReader());

String line;
while(line=rd.readLine() != null)
{
response += line;
}

Will this work or is there more to be done?

Lothar Kimmeringer · Feb 19, 2009

SpreadTooThin said:
I want to download a website and store it in an html file.

I don't want to recurs all the links that might be in the site, but i
do want all the images and formatting that are displayed just as when
you visit the sight.

I think it should be as simple as:

URL url = new URL("http://example.com");

HttpURLConnection conn = (HttpURLConnection) url.openConnection();

BufferedReader rd = new BufferedReader(new InputStreamReader

(conn.getInputStreamReader());

String line;
while(line=rd.readLine() != null)
{
response += line;
}

Will this work or is there more to be done?

There is more to do because this will only give you the
HTML-page itself which is in today's life just a single
part. In addition to that you often need CSS-files, Javascript-
files etc.

If you're able to read C-Sources you might check out
http://www.httrack.com/proxytrack/
which is Open Source and allows you to have a look into the
internals of the program. If you really just want to copy
a webpage and you don't need to do an implementation, just
select File->Save Page As... if you're using Firefox. This
will do exactly what you want to do. Or use the above linked
program that can do that for one page or a complete website.

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

Arne Vajhøj · Feb 20, 2009

SpreadTooThin said:
I want to download a website and store it in an html file.
I don't want to recurs all the links that might be in the site, but i
do want all the images and formatting that are displayed just as when
you visit the sight.

I think it should be as simple as:

URL url = new URL("http://example.com");

HttpURLConnection conn = (HttpURLConnection) url.openConnection();

BufferedReader rd = new BufferedReader(new InputStreamReader

(conn.getInputStreamReader());

String line;
while(line=rd.readLine() != null)
{
response += line;
}

Will this work or is there more to be done?

You will need to parse the returned HTML possible using regex
and retrieve the images as well (note that those are binary and
should not be read with BufferedReader).

If the sites require login etc. then you will find Jakarta HttpClient
easier than (Http)URLConnection.

Arne

PS: If you download really huge HTML file you should be using
StringBuilder instead of appending to a String.

SpreadTooThin · Feb 20, 2009

You will need to parse the returned HTML possible using regex
and retrieve the images as well (note that those are binary and
should not be read with BufferedReader).

If the sites require login etc. then you will find Jakarta HttpClient
easier than (Http)URLConnection.

Arne

PS: If you download really huge HTML file you should be using
StringBuilder instead of appending to a String.

Can I treat the returned html like an xml document?

Arne Vajhøj · Feb 20, 2009

SpreadTooThin said:
Can I treat the returned html like an xml document?

Unless it is actually XHTML then no.

Arne

Tom Anderson · Feb 20, 2009

Unless it is actually XHTML then no.

My suggestion would be to use HtmlUnit, which will turn HTML, even bad
HTML, into something sufficiently XML-like (ie a DOM tree) that it becomes
tractable:

http://htmlunit.sourceforge.net/

That will download and parse the HTML, and you can then query to find all
the image elements and download their sources. Like this:

WebClient client = new WebClient(BrowserVersion.FIREFOX_3);
HtmlPage page = client.getPage("http://example.com");
for (Object obj: page.getByXPath("//img")) {
HtmlImage img = (HtmlImage)obj;
// there are various ways you could save the image data - here's a simple but inefficient one:
byte[] imgData = img.getWebResponse().getContentAsBytes();
OutputStream out = new FileOutputStream(img.getSrcAttribute()); // this might not be a good idea
out.write(imgData);
out.close();
}

You can also find all the links to other pages using similar logic, which
simplifies spidering.

I would definitely not try to do this using regexps or a handmade parser -
that would be reinventing the wheel, and there are too many corner cases
for it to be an easy job.

tom

Roedy Green · Feb 23, 2009

Can I treat the returned html like an xml document?

Typical HTML is crawling with syntax errors. IE's forgiveness of
errors has encouraged ever more sloppiness.

see http://mindprod.com/jgloss/screenscraping.html
for some tips on how to handle it, including TagSoup.
--
Roedy Green Canadian Mind Products
http://mindprod.com

One path leads to despair and utter hopelessness. The other,
to total extinction. Let us pray we have the wisdom to choose correctly.
~ Woody Allen .

HTTP request with trailer	0	Mar 22, 2024
A website that I couldn't make a screenshot of it nor save any page from.	1	Oct 29, 2023
Imagehost Grabber alternate download site or replacement?	2	Apr 30, 2014
The distinction between a java applet and an application	1	Jan 4, 2023
Accepting a cookie and using it	1	Jan 8, 2008
Help with my responsive home page	2	Dec 14, 2022
How to slurp/get the content of a URI?	13	Jul 20, 2008
The program will choke at the place of (line = reader.readLine()) != null)	7	Mar 4, 2012

download a web site... or a page of it...

SpreadTooThin

Lothar Kimmeringer

Arne Vajhøj

SpreadTooThin

Arne Vajhøj

Tom Anderson

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads