How to extract links

suman.tedla · Jan 29, 2005

Hi ,

I am trying to extract all the hyper links in a google result page to a
file, using java.

i.e when I search for say, "JAVA" in google, i have to capture all the
resulting links for "JAVA" in to a file, using java.

Can anyone help me on this, and tell me how to start with and how to do
the extraction.

This is very important to me.
Your help is highly appreciated.
Thanks in advance.

hilz · Jan 29, 2005

Hi ,

I am trying to extract all the hyper links in a google result page to a
file, using java.

i.e when I search for say, "JAVA" in google, i have to capture all the
resulting links for "JAVA" in to a file, using java.

Can anyone help me on this, and tell me how to start with and how to do
the extraction.

This is very important to me.
Your help is highly appreciated.
Thanks in advance.

To start, look at the java.net package, specifically the URL class , and the
URLConnection interface. These will help you connect to the URL and get the
text of the page.
You probably also need to use a java.util.regex package to find the links
in that page.

If you show your code, and have more specific questions, you will get better
answers.

good luck.
hilz

Aleksander =?iso-8859-2?Q?Str=B1czek?= · Jan 29, 2005

W artykule said:
To start, look at the java.net package, specifically the URL class , and the
URLConnection interface. These will help you connect to the URL and get the
text of the page.
You probably also need to use a java.util.regex package to find the links
in that page.

If you show your code, and have more specific questions, you will get better
answers.

URL class from java.net package doesn't work for me (403 - google protection).
I suggest use httpclient to get results from google,
than htmlunit to easy extract links (http://htmlunit.sourceforge.net).

Here is working sample:
(to run it see http://htmlunit.sourceforge.net/dependencies.html).

import java.io.FileWriter;
import java.io.PrintWriter;
import java.net.URL;
import java.util.Iterator;
import java.util.List;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class LinkExtract {

private static final String FILE_NAME = "results.txt";

private static final String QUERY = "JAVA";

public static void main(String[] args) {

PrintWriter printWriter = null;
try {
WebClient wc = new WebClient();
URL url = new URL("http://www.google.com/search?q=" + QUERY);
HtmlPage page = (HtmlPage) wc.getPage(url);
printWriter = new PrintWriter(new FileWriter(FILE_NAME));
List anchors = page.getAnchors();
for (Iterator iter = anchors.iterator(); iter.hasNext()

{
HtmlAnchor anchor = (HtmlAnchor) iter.next();
if (isSkipLink(anchor)) {
continue;
}
printWriter.println(anchor.getHrefAttribute());
}
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
if (printWriter != null) {
printWriter.close();
}
}
}

/**
* Decide if this link has to be processed.
*
* @param anchor
* link
* @return true if link has to be omitted, false if is to be processed
*/
private static boolean isSkipLink(HtmlAnchor anchor) {

return anchor.getHrefAttribute().startsWith("/")
|| anchor.getHrefAttribute().indexOf("/search?q=cache:") > 0;
}
}

Andrey Kuznetsov · Jan 30, 2005

/**

* Decide if this link has to be processed.
*
* @param anchor
* link
* @return true if link has to be omitted, false if is to be processed
*/
private static boolean isSkipLink(HtmlAnchor anchor) {

return anchor.getHrefAttribute().startsWith("/")
|| anchor.getHrefAttribute().indexOf("/search?q=cache:") >
0;
}
}

do you know how to skip "sponsored" links?

Aleksander =?iso-8859-2?Q?Str=B1czek?= · Jan 30, 2005

Andrey said:
do you know how to skip "sponsored" links?

DISCLAIMER
The previous post intention is give hints how to play with html content
in java application. Google service was only used for example.
I anyone want to use this hints for google services in any type of application
read carefully
http://www.google.pl/intl/pl/terms_of_service.html
Especially sections:
Personal Use Only
and
No Automated Querying.

MadhuP · Mar 11, 2011

I got an error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/clien
t/CredentialsProvider
at LinkExtract.main(LinkExtract.java:22)
Caused by: java.lang.ClassNotFoundException: org.apache.http.client.CredentialsP
rovider
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)

how to rectify it... plz help..

How to extract image from PDF in Python	0	May 24, 2022
How to extract all values except the last value in a string separated by comma in sql	2	Jun 15, 2023
SQL Problem Using Extract Command	0	Apr 7, 2022
Extract links from HTML	2	Oct 22, 2008
How to get expertise in "cyber security" or from where to start for this?	0	Apr 20, 2024
Client AXIS2 Extract Attachment MTOM Base64Binary response	0	May 21, 2013
How to convert MBOX files to HTML?	2	Dec 25, 2024
How to effectively develop a web application from scratch?	0	Jul 2, 2023

How to extract links

suman.tedla

hilz

Aleksander =?iso-8859-2?Q?Str=B1czek?=

Andrey Kuznetsov

Aleksander =?iso-8859-2?Q?Str=B1czek?=

MadhuP

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads