W artykule said:
To start, look at the java.net package, specifically the URL class , and the
URLConnection interface. These will help you connect to the URL and get the
text of the page.
You probably also need to use a java.util.regex package to find the links
in that page.
If you show your code, and have more specific questions, you will get better
answers.
URL class from java.net package doesn't work for me (403 - google protection).
I suggest use httpclient to get results from google,
than htmlunit to easy extract links (
http://htmlunit.sourceforge.net).
Here is working sample:
(to run it see
http://htmlunit.sourceforge.net/dependencies.html).
import java.io.FileWriter;
import java.io.PrintWriter;
import java.net.URL;
import java.util.Iterator;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class LinkExtract {
private static final String FILE_NAME = "results.txt";
private static final String QUERY = "JAVA";
public static void main(String[] args) {
PrintWriter printWriter = null;
try {
WebClient wc = new WebClient();
URL url = new URL("
http://www.google.com/search?q=" + QUERY);
HtmlPage page = (HtmlPage) wc.getPage(url);
printWriter = new PrintWriter(new FileWriter(FILE_NAME));
List anchors = page.getAnchors();
for (Iterator iter = anchors.iterator(); iter.hasNext()

{
HtmlAnchor anchor = (HtmlAnchor) iter.next();
if (isSkipLink(anchor)) {
continue;
}
printWriter.println(anchor.getHrefAttribute());
}
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
if (printWriter != null) {
printWriter.close();
}
}
}
/**
* Decide if this link has to be processed.
*
* @param anchor
* link
* @return true if link has to be omitted, false if is to be processed
*/
private static boolean isSkipLink(HtmlAnchor anchor) {
return anchor.getHrefAttribute().startsWith("/")
|| anchor.getHrefAttribute().indexOf("/search?q=cache:") > 0;
}
}