HTTP connection doesn't work on digg?

R

Russell Glasser

I'm trying to familiarize myself with the method of connecting to web
sites with Java. I've written a simple program to connect to a page
at a given URL, but I've noticed it behaves differently for different
sites.

Here's some code which I have stripped of most of the extra stuff just
to highlight the problem:

----

public void connTest (String addr)
{
try {
System.out.println("Trying to connect to "+addr);
URL u = new URL(addr);
HttpURLConnection conn = (HttpURLConnection) u.openConnection();
conn.connect();
InputStream is = conn.getInputStream();
System.out.println("Input stream is open...");
is.close();
conn.disconnect();
} catch (Exception e) {
System.out.println ("Something's wrong");
}
}

----

Then to invoke it, I try:

connTest("http://www.google.com");
connTest("http://www.digg.com");

Here's the output:

Trying to connect to http://www.google.com
Input stream is open...
Trying to connect to http://www.digg.com


The first method call takes a few seconds, but then gives me what I
asked for (and then I can go ahead and print out all the html with a
reader). The second method call just hangs. As soon as it hits the
line "InputStream is = conn.getInputStream();" it's stuck. The same
thing happens if I try to get any other property, such as
getResponseCode.

I've tried this with several web sites and Digg is the only widely-
used site that gives me this problem. But I can open it in a browser
just fine. Am I doing something wrong?
 
G

Gordon Beaton

I'm trying to familiarize myself with the method of connecting to
web sites with Java. I've written a simple program to connect to a
page at a given URL, but I've noticed it behaves differently for
different sites.

To debug protocol issues like this, use a tool like Wireshark to see
exactly what the browser is doing and compare it to what your program
does.

I think you need to set a User-Agent in the request before connecting:

conn.setRequestProperty("User-Agent", "something useful");

You can use Google to find lists of valid User-Agent strings for
various browsers and OS platforms, or cut and paste one from the dump
you get from Wireshark when running your regular browser.

/gordon

--
 
I

Ian Wilson

Russell said:
I'm trying to familiarize myself with the method of connecting to web
sites with Java. I've written a simple program to connect to a page
at a given URL, but I've noticed it behaves differently for different
sites.

Here's some code which I have stripped of most of the extra stuff just
to highlight the problem:

----

public void connTest (String addr)
{
try {
System.out.println("Trying to connect to "+addr);
URL u = new URL(addr);
HttpURLConnection conn = (HttpURLConnection) u.openConnection();
conn.connect();
InputStream is = conn.getInputStream();
System.out.println("Input stream is open...");
is.close();
conn.disconnect();
} catch (Exception e) {
System.out.println ("Something's wrong");
}
}

----

Then to invoke it, I try:

connTest("http://www.google.com");
connTest("http://www.digg.com");

Here's the output:

Trying to connect to http://www.google.com
Input stream is open...
Trying to connect to http://www.digg.com


The first method call takes a few seconds, but then gives me what I
asked for (and then I can go ahead and print out all the html with a
reader). The second method call just hangs. As soon as it hits the
line "InputStream is = conn.getInputStream();" it's stuck. The same
thing happens if I try to get any other property, such as
getResponseCode.

I've tried this with several web sites and Digg is the only widely-
used site that gives me this problem. But I can open it in a browser
just fine. Am I doing something wrong?

1) You may not be waiting long enough, It can take minutes for DNS
resolution to give up or for conection attempts to fail. Try
connTest("http://imaginary.example.com") and see what result you get and
how long it takes.

2) Your exception handling discards all the useful information in the
exception. I'd at least print e.getMessage() or a stack trace.

3) Popular free services (like Google) often take measures to prevent
use of their normal HTTP service by anything other than a human clicking
a web-browser. Sometimes they have an API and a registration process for
software authors. Maybe Digg is even more intolerant than Google of what
they perceive as inappropriate use?
 
R

Russell Glasser

To debug protocol issues like this, use a tool like Wireshark to see
exactly what the browser is doing and compare it to what your program
does.

I think you need to set a User-Agent in the request before connecting:

conn.setRequestProperty("User-Agent", "something useful");

You can use Google to find lists of valid User-Agent strings for
various browsers and OS platforms, or cut and paste one from the dump
you get from Wireshark when running your regular browser.

/gordon

--

Thanks Gordon, your suggestion was right on. I added a User-Agent
string and it works now.

Russell
 
R

Russell Glasser

1) You may not be waiting long enough, It can take minutes for DNS
resolution to give up or for conection attempts to fail. Try
connTest("http://imaginary.example.com") and see what result you get and
how long it takes.

That's interesting. My problem is solved now so it doesn't matter,
but how do IE and Firefox handle this delay? When I type an invalid
address into a browser, I get an error very quickly.
2) Your exception handling discards all the useful information in the
exception. I'd at least print e.getMessage() or a stack trace.

Yes, I know that. I handle different error types in my real program,
but I wanted to post something bare-bones online.

Anyway, the code wasn't reaching the "catch" block -- it was just
stopping dead. So having a descriptive message wouldn't have helped.
3) Popular free services (like Google) often take measures to prevent
use of their normal HTTP service by anything other than a human clicking
a web-browser. Sometimes they have an API and a registration process for
software authors. Maybe Digg is even more intolerant than Google of what
they perceive as inappropriate use?

That is interesting, and probably explains why I had to enter an agent
to my code.

One worry recently occurred to me. I'm planning to do a long-term data
analysis project for grad school, where I'm basically sucking data off
of popular web 2.0 sites like Digg and then doing data mining on them
to learn about interesting trends.

I wonder, how much risk is there that someone at these sites will
notice some non-standard usage, and then decide to block me?
 
G

Gordon Beaton

One worry recently occurred to me. I'm planning to do a long-term
data analysis project for grad school, where I'm basically sucking
data off of popular web 2.0 sites like Digg and then doing data
mining on them to learn about interesting trends.

I wonder, how much risk is there that someone at these sites will
notice some non-standard usage, and then decide to block me?

If your tool behaves like a standard webcrawler, there should be no
issues. However that means respecting things like robots.txt and other
mechanisms webcrawlers are expected to obey.

Some information here:
http://www.robotstxt.org/wc/robots.html
http://www.robotstxt.org/wc/guidelines.html
http://en.wikipedia.org/wiki/Robots_Exclusion_Standard
http://en.wikipedia.org/wiki/Spider_trap
http://en.wikipedia.org/wiki/Web_Crawler

/gordon

--
 
I

Ian Wilson

Russell said:
One worry recently occurred to me. I'm planning to do a long-term data
analysis project for grad school, where I'm basically sucking data off
of popular web 2.0 sites like Digg and then doing data mining on them
to learn about interesting trends.

I wonder, how much risk is there that someone at these sites will
notice some non-standard usage, and then decide to block me?

I think the polite thing to do would be to tell them what you plan to do
and ask them if they have any objections.

I guess you already read their terms and conditions of use? It sounds to
me like you should be using their RSS feeds rather than "sucking" HTML
pages.

"8 with the exception of accessing RSS feeds, you will not use any
robot, spider, scraper or other automated means to access the Site for
any purpose without our express written permission. Additionally, you
agree that you will not: (i) take any action that imposes, or may impose
in our sole discretion an unreasonable or disproportionately large load
on our infrastructure; (ii) interfere or attempt to interfere with the
proper working of the Site or any activities conducted on the Site; or
(iii) bypass any measures we may use to prevent or restrict access to
the Site;"

Have you briefed your School's legal team yet :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top