infinite loop with http requests

Y

yawnmoth

I'm trying to write something that'll let me output the contents of a
given webpage while skipping over the headers. Since I'm trying to
learn raw HTTP, I'm using Sockets and not URL.

Anyway, the header of an HTTP response ends when you have "\r\n\r\n".
BufferedReader's readLine treats that as two lines since it considers
"\r\n" to be a line terminating character. Since it also strips off
the line terminating characters, readLine should return the second line
as "".

Per that, I've written a program that will loop, continuously, until ""
is encountered. Unfortunately, "" never appears to be encountered and
thus I have an infinite loop.

Here's my code:

import java.net.*;
import java.io.*;

public class HttpRequestor
{
public static void main(String[] args) {
try {
Socket sock = new Socket("www.google.com", 80);
String httpRequest = "GET / HTTP/1.0\r\nHost:
www.google.com\r\n\r\n";
sock.getOutputStream().write(httpRequest.getBytes());
BufferedReader text = new BufferedReader(new
InputStreamReader(sock.getInputStream()));

String line, output = "";
while (text.readLine() != "");
while ((line = text.readLine()) != null) {

System.out.println("\r\n'"+URLEncoder.encode(line)+"'\r\n");
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}

To confirm that I was indeed getting "" back from readLine, I wrote the
following:

import java.net.*;
import java.io.*;

public class HttpRequestor
{
public static void main(String[] args) {
try {
Socket sock = new Socket("www.google.com", 80);
String httpRequest = "GET / HTTP/1.0\r\nHost:
www.google.com\r\n\r\n";
sock.getOutputStream().write(httpRequest.getBytes());
BufferedReader text = new BufferedReader(new
InputStreamReader(sock.getInputStream()));

String line, output = "";
while ((line = text.readLine()) != null) {

System.out.println("\r\n'"+URLEncoder.encode(line)+"'\r\n");
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}

This shows that "" is indeed being returned by readLine. So why
doesn't the while loop in the first program terminate when "" is
received?

Any insights would be appreciated - thanks!
 
R

Robert Klemme

I'm trying to write something that'll let me output the contents of a
given webpage while skipping over the headers. Since I'm trying to
learn raw HTTP, I'm using Sockets and not URL.

Anyway, the header of an HTTP response ends when you have "\r\n\r\n".
BufferedReader's readLine treats that as two lines since it considers
"\r\n" to be a line terminating character. Since it also strips off
the line terminating characters, readLine should return the second line
as "".

Per that, I've written a program that will loop, continuously, until ""
is encountered. Unfortunately, "" never appears to be encountered and
thus I have an infinite loop.

Here's my code:

import java.net.*;
import java.io.*;

public class HttpRequestor
{
public static void main(String[] args) {
try {
Socket sock = new Socket("www.google.com", 80);
String httpRequest = "GET / HTTP/1.0\r\nHost:
www.google.com\r\n\r\n";
sock.getOutputStream().write(httpRequest.getBytes());
BufferedReader text = new BufferedReader(new
InputStreamReader(sock.getInputStream()));

String line, output = "";
while (text.readLine() != "");
while ((line = text.readLine()) != null) {

System.out.println("\r\n'"+URLEncoder.encode(line)+"'\r\n");
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}

To confirm that I was indeed getting "" back from readLine, I wrote the
following:

import java.net.*;
import java.io.*;

public class HttpRequestor
{
public static void main(String[] args) {
try {
Socket sock = new Socket("www.google.com", 80);
String httpRequest = "GET / HTTP/1.0\r\nHost:
www.google.com\r\n\r\n";
sock.getOutputStream().write(httpRequest.getBytes());
BufferedReader text = new BufferedReader(new
InputStreamReader(sock.getInputStream()));

String line, output = "";
while ((line = text.readLine()) != null) {

System.out.println("\r\n'"+URLEncoder.encode(line)+"'\r\n");
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}

This shows that "" is indeed being returned by readLine. So why
doesn't the while loop in the first program terminate when "" is
received?

Because you compare strings with == (identity) instead with equals()
(equivalence).

robert
 
O

Oliver Wong

yawnmoth said:
I'm trying to write something that'll let me output the contents of a
given webpage while skipping over the headers. Since I'm trying to
learn raw HTTP, I'm using Sockets and not URL.

[snip most of the code]
Socket sock = new Socket("www.google.com", 80);

I recommend against using google as your test server. Google does some
funky stuff when it detects that Java is connecting to it, which may give
you unexpected results.

- Oliver
 
D

Daniel Pitts

Oliver said:
yawnmoth said:
I'm trying to write something that'll let me output the contents of a
given webpage while skipping over the headers. Since I'm trying to
learn raw HTTP, I'm using Sockets and not URL.

[snip most of the code]
Socket sock = new Socket("www.google.com", 80);

I recommend against using google as your test server. Google does some
funky stuff when it detects that Java is connecting to it, which may give
you unexpected results.

- Oliver

Good suggestion except for two things, He isn't using Java's URL API,
which is what's responsible for setting the User-Agent string. Second,
you can override the User-Agent string, and google couldn't possible
know the difference.

In any case, his problem is that the OP is comparingwith line == "",
when he should use line.equals(""), or better yet line.size() == 0

HTH,
Daniel.
 
C

Chris Uppal

Daniel said:
Oliver said:
I recommend against using google as your test server. Google does
some funky stuff when it detects that Java is connecting to it, which
may give you unexpected results.
[...]
Good suggestion except for two things, He isn't using Java's URL API,
which is what's responsible for setting the User-Agent string. Second,
you can override the User-Agent string, and google couldn't possible
know the difference.

I agree with Oliver's advice. Google is perfectly at liberty to treat requests
differently depending on how they /appear/ to have been submitted.

If I were them I would group requests into at least three categories: ones that
appear to be legit (as far as we can tell from the various meta-info in a
request); those that appear to come from frequently abused clients (such as the
Java stuff); and those where we can't tell much. I would be less aggressive
about -- say -- shutting off an over-eager client IP address if the requests
appeared to be from a normal browser than if they appeared to come from
uncontrolled code. And I'd put the "can't tell" ones somewhere in the middle.

But the bottom line is not that Google /can/ treat requests differently
depending on apparently immaterial meta stuff, but that it /does/ do so --
which makes it a very poor example domain for a beginner (to HTTP) to test
against.

-- chris
 
D

Daniel Pitts

Chris said:
Daniel said:
Oliver said:
I recommend against using google as your test server. Google does
some funky stuff when it detects that Java is connecting to it, which
may give you unexpected results.
[...]
Good suggestion except for two things, He isn't using Java's URL API,
which is what's responsible for setting the User-Agent string. Second,
you can override the User-Agent string, and google couldn't possible
know the difference.

I agree with Oliver's advice. Google is perfectly at liberty to treat requests
differently depending on how they /appear/ to have been submitted.

If I were them I would group requests into at least three categories: ones that
appear to be legit (as far as we can tell from the various meta-info in a
request); those that appear to come from frequently abused clients (such as the
Java stuff); and those where we can't tell much. I would be less aggressive
about -- say -- shutting off an over-eager client IP address if the requests
appeared to be from a normal browser than if they appeared to come from
uncontrolled code. And I'd put the "can't tell" ones somewhere in the middle.

But the bottom line is not that Google /can/ treat requests differently
depending on apparently immaterial meta stuff, but that it /does/ do so --
which makes it a very poor example domain for a beginner (to HTTP) to test
against.

-- chris

Okay, while my point was that you can "trick" google into thinking that
it is probably a legit client, your point is well taken.

I suppose a good way to learn HTTP is to set up a webserver in your own
development environment (such as apache, resin, etc...), and use it
instead of a third party website. That way you also have control over
the content being produced.

- Daniel.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top