How to extract links

S

suman.tedla

Hi ,

I am trying to extract all the hyper links in a google result page to a
file, using java.

i.e when I search for say, "JAVA" in google, i have to capture all the
resulting links for "JAVA" in to a file, using java.

Can anyone help me on this, and tell me how to start with and how to do
the extraction.

This is very important to me.
Your help is highly appreciated.
Thanks in advance.
 
H

hilz

Hi ,

I am trying to extract all the hyper links in a google result page to a
file, using java.

i.e when I search for say, "JAVA" in google, i have to capture all the
resulting links for "JAVA" in to a file, using java.

Can anyone help me on this, and tell me how to start with and how to do
the extraction.

This is very important to me.
Your help is highly appreciated.
Thanks in advance.

To start, look at the java.net package, specifically the URL class , and the
URLConnection interface. These will help you connect to the URL and get the
text of the page.
You probably also need to use a java.util.regex package to find the links
in that page.

If you show your code, and have more specific questions, you will get better
answers.

good luck.
hilz
 
A

Aleksander =?iso-8859-2?Q?Str=B1czek?=

W artykule said:
To start, look at the java.net package, specifically the URL class , and the
URLConnection interface. These will help you connect to the URL and get the
text of the page.
You probably also need to use a java.util.regex package to find the links
in that page.

If you show your code, and have more specific questions, you will get better
answers.

URL class from java.net package doesn't work for me (403 - google protection).
I suggest use httpclient to get results from google,
than htmlunit to easy extract links (http://htmlunit.sourceforge.net).

Here is working sample:
(to run it see http://htmlunit.sourceforge.net/dependencies.html).

import java.io.FileWriter;
import java.io.PrintWriter;
import java.net.URL;
import java.util.Iterator;
import java.util.List;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class LinkExtract {

private static final String FILE_NAME = "results.txt";

private static final String QUERY = "JAVA";

public static void main(String[] args) {

PrintWriter printWriter = null;
try {
WebClient wc = new WebClient();
URL url = new URL("http://www.google.com/search?q=" + QUERY);
HtmlPage page = (HtmlPage) wc.getPage(url);
printWriter = new PrintWriter(new FileWriter(FILE_NAME));
List anchors = page.getAnchors();
for (Iterator iter = anchors.iterator(); iter.hasNext();) {
HtmlAnchor anchor = (HtmlAnchor) iter.next();
if (isSkipLink(anchor)) {
continue;
}
printWriter.println(anchor.getHrefAttribute());
}
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
if (printWriter != null) {
printWriter.close();
}
}
}

/**
* Decide if this link has to be processed.
*
* @param anchor
* link
* @return true if link has to be omitted, false if is to be processed
*/
private static boolean isSkipLink(HtmlAnchor anchor) {

return anchor.getHrefAttribute().startsWith("/")
|| anchor.getHrefAttribute().indexOf("/search?q=cache:") > 0;
}
}
 
A

Andrey Kuznetsov

/**
* Decide if this link has to be processed.
*
* @param anchor
* link
* @return true if link has to be omitted, false if is to be processed
*/
private static boolean isSkipLink(HtmlAnchor anchor) {

return anchor.getHrefAttribute().startsWith("/")
|| anchor.getHrefAttribute().indexOf("/search?q=cache:") >
0;
}
}
do you know how to skip "sponsored" links?
 
A

Aleksander =?iso-8859-2?Q?Str=B1czek?=

Andrey said:
do you know how to skip "sponsored" links?

DISCLAIMER
The previous post intention is give hints how to play with html content
in java application. Google service was only used for example.
I anyone want to use this hints for google services in any type of application
read carefully
http://www.google.pl/intl/pl/terms_of_service.html
Especially sections:
Personal Use Only
and
No Automated Querying.
 
Joined
Mar 11, 2011
Messages
1
Reaction score
0
I got an error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/clien
t/CredentialsProvider
at LinkExtract.main(LinkExtract.java:22)
Caused by: java.lang.ClassNotFoundException: org.apache.http.client.CredentialsP
rovider
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)

how to rectify it... plz help..
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top