faking HTTP with a socket channel

R

Roedy Green

If you fake HTTP by opening a socket then sending an HTTP header, does
the HTTP stuff go out in the first packet to the server, or is there
some sort of handshake first to establish the channel?

When you get your HttpConnection object, just what has happened at
that point?
 
A

Andreas Leitgeb

Roedy Green said:
If you fake HTTP by opening a socket then sending an HTTP header, does
the HTTP stuff go out in the first packet to the server, or is there
some sort of handshake first to establish the channel?

There's of course a lot going on at the lower levels (tcp/ip,
ethernet/ppp/...)

But taking the establishing of a tcp/ip connection as granted,
then the HTTP-request-headers sent by the client are the first
data sent over that stream.

PS: "fake" is a wrong term for that. If you open a socket and
talk HTTP-protocol over it, then you *do* http, rather than
fake it.
When you get your HttpConnection object, just what has happened at
that point?

Not exactly sure, but it is possible, that this class may contain
a socket that doesn't directly target the http-server, but instead
goes to a proxy. I'm not sure about whether there are any differences
to my above statements (about HTTP-requests being first) when going
over proxy - e.g. there might be a step of authentication to the proxy
first, if the proxy so requires (but then again, that may also be
done through some extra http-headers)
 
T

Tom Anderson

If you fake HTTP by opening a socket then sending an HTTP header, does
the HTTP stuff go out in the first packet to the server, or is there
some sort of handshake first to establish the channel?

There's a handshake at the TCP level - it's called the "three-way
handshake", because it involves the client sending a connection request
packet ('SYN'), the server responding with an acceptance ('SYN+ACK'), and
the client then sending a confirmation of the acceptance ('ACK'). The ACK
packet can carry data; the other two can't. Thus, there is one round-trip
between client and server before you can start sending actual data.

There was a TCP variant, T/TCP, which let you send data in the first and
second packets too, but it never caught on, in part because there were
security concerns.

If you treat TCP as just an abstract pipe for bytes, then there's no other
handshaking before the HTTP command. The first data bytes to travel over
the TCP connection will be the 0x47, 0x45, 0x54, 0x20 at the start of the
command (assuming it's a GET, that is).
When you get your HttpConnection object, just what has happened at that
point?

Depends how you get it. If you construct it directly, nothing - the socket
isn't opened until you call connect().

tom
 
M

Mark Space

Roedy said:
If you fake HTTP by opening a socket then sending an HTTP header, does
the HTTP stuff go out in the first packet to the server, or is there
some sort of handshake first to establish the channel?

When you get your HttpConnection object, just what has happened at
that point?

The easiest way to figure out what is going on would be to get some
packet sniffer software. You can't really debug networking problems
unless you can be sure of what's going on the wire.

That said, I *think* the HTTP "handshake" consists of the client loading
up a lot of headers that say "I like this sort of format and I can talk
French too, please send me back something good." It's not so much a
handshake as a hope and a prayer that the server is able to comply. In
a pinch, everything just defaults down to LATIN-1 and plain text.

You can open a raw socket to port 80, send "GET" followed by a newline,
and the server should be able to respond right there. That's about as
simple (and fake) as I can think of.
 
R

Roedy Green

Depends how you get it. If you construct it directly, nothing - the socket
isn't opened until you call connect().

When you hit connect what has been transferred?
 
R

Roedy Green

The easiest way to figure out what is going on would be to get some
packet sniffer software. You can't really debug networking problems
unless you can be sure of what's going on the wire.

I have done a fair bit of wiresharking, so I am pretty clear on what
goes back and forth for HTTP. The mysteries were what happens as the
socket/packet levels to get it fired up and what Java methods do what.
 
N

Neil Coffey

Roedy said:
When you hit connect what has been transferred?

Roedy -- at the HTTP level, nothing is transferred at that point.
A TCP connection is opened (so for example, something sitting in
ServerSocket.accept() at the other end will get "woken up" with the
incoming connection) but at the application level, no tangible bytes
are sent at that stage.

Then when you call getInputStream() on the HttpURLConnection, at
*that* point, the actual HTTP request and associated headers are
sent over the connection.

However, I think this is just an implementational detail that we're
not supposed to rely on. Logically speaking, since no requets or
headers have been sent, we would expect to be able to set headers
after calling connect(). But according to the spec, it's an error
to do so...

Neil
 
R

Roedy Green

Roedy -- at the HTTP level, nothing is transferred at that point.
A TCP connection is opened (so for example, something sitting in
ServerSocket.accept() at the other end will get "woken up" with the
incoming connection) but at the application level, no tangible bytes
are sent at that stage.

Studying some code I wrote, I think connect sends the header, because
you can see the Response code right away, even before you call
getInputStream.

As I sort this out, and compile the information you have all provided,
I post it at http://mindprod.com/jgloss/http.html#UNDERTHEHOOD

Here is my latest understanding:

What happens when your Java-based browser requests a page?

1. Nothing at all happens until it does an HTTPConnection.connect().

2. This triggers opening a TCP/IP socket connection to the server.
This is done by sending a SYN connection request packet. The server
sends back an SYN+ACK. Then the client sends an ACK, upon which may be
piggybacked some data.

3. This triggers sending a GET header composed of all the header
fields set up before the .connect. The GET request sent includes a
list of the encodings and compression algorithms we like.

4. Then the server sends back the requested page.

5. It calls HTTPConnection.getResponseCode to see if request went ok.

6. Then it calls HTTPConnection.getInputStream and reads the text of
the message from the server containing the requested web page.

7. It then scans the web page for the urls of embedded images and puts
out GET requests for them.

8. Then various images come back from the server on the original
socket or on their own sockets so they can arrive simultaneously.

The stream is made purely of printable characters. The server can
detect the start of a new GET request by looking for line terminators.


I know if connect blocks until a response or getResponseCode does.

I could find out by dumping some nanotimes in the process.
 
N

Neil Coffey

Roedy said:
2. This triggers opening a TCP/IP socket connection to the server.
This is done by sending a SYN connection request packet. The server
sends back an SYN+ACK. Then the client sends an ACK, upon which may be
piggybacked some data.

From a theoretical point of view, I think you're correct. But in
the examples I've seen, as I say, Java doesn't actually seem to
send any data at this stage. But in terms of using the API, you're
supposed to assume that it has sent the headers, as far as I can see.
3. This triggers sending a GET header composed of all the header
fields set up before the .connect. The GET request sent includes a
list of the encodings and compression algorithms we like.

4. Then the server sends back the requested page.

5. It calls HTTPConnection.getResponseCode to see if request went ok.
>
> 6. Then it calls HTTPConnection.getInputStream and reads the text of
> the message from the server containing the requested web page.

This sounds a bit muddled. The *first* line of the response from the
server includes the response code. Under the hood, I think this
is probably read when you call getInputStream(), but I haven't
investigated. Then come any response headers -- again, probably read
in automagically when you call getInputStream().

Then when you read from the stream returned by getInputStream(), you're
reading directly from the underlying socket as far as I'm aware. I'm
not quite sure why you separate out "the server sends back the
requested page" and the client "read[ing] the text of the message...":
at the application level, these are one and the same process.
8. Then various images come back from the server on the original
socket or on their own sockets so they can arrive simultaneously.

OK, although you make it sound as though one request per connection is
the desirable case; from the server's point of view, it definitely
isn't. As of HTTP 1.1, the default behaviour is to perform multiple
requests per socket if possible. (It isn't when the server can't
calculate the size of the returned data in advance, e.g. with many
dynamic web pages.)
The stream is made purely of printable characters. The server can
detect the start of a new GET request by looking for line terminators.

I think this is true of the request stream sent TO the server
specifically for the case of GET commands. But the data sent back
FROM the server (after the headers) can be any sequence of bytes. And
you can POST any sequence of bytes to the server.

Neil
 
T

Tom Anderson

OK, although you make it sound as though one request per connection is
the desirable case; from the server's point of view, it definitely
isn't. As of HTTP 1.1, the default behaviour is to perform multiple
requests per socket if possible. (It isn't when the server can't
calculate the size of the returned data in advance, e.g. with many
dynamic web pages.)

To expand on that, if the server has an output buffer, and the generated
page is smaller than the buffer, then it can collect the whole thing in
the buffer, work out the size, send a Content-Length header, and reuse the
connection afterwards. If it's bigger than the buffer, it can't do this -
i think that's what you were getting at with "many dynamic web pages".

What you might not be aware of is the chunked encoding:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1
http://developers.sun.com/mobility/midp/questions/chunking/

Which allows the transfer of unknown-length data across an HTTP connection
without having to close the stream afterwards. Thus, even for dynamic
pages bigger than the buffer, the connection can be reused.

tom
 
R

Roedy Green

When you get your HttpConnection object, just what has happened at
that point?

I did an experiment here in the output:

a: 4160
b: 4362600 <- . openConnection
c: 142284360 <- .connect
200
d: 229896880 <- .getResponseCode
e: 2611160 <-.getInputStream

So it looks like .connect does the work -- establishing the socket and
sending the header.

It looks like getResponse code blocks until the response from the
server.





elapsed( "a:" );

URLConnection urlc = source.openConnection();
elapsed( "b:" );

if ( urlc == null )
{
throw new IOException( "Unable to make a connection." );
}
else
{
if ( DEBUGGING )
{
// get simple name of class without the package
String s = urlc.getClass().getName();
// manually chop off all past the last dot.
int lastDot = s.lastIndexOf( '.' );
if ( lastDot >= 0 )
{
s = s.substring( lastDot + 1 );
}
System.out.println( " Connecting to " +
source.toString() + " with " + s );
}
}
urlc.setAllowUserInteraction( false );
urlc.setDoInput( true );
urlc.setDoOutput( false );
urlc.setUseCaches( false );
urlc.setReadTimeout( readTimeout );
urlc.setConnectTimeout( connectTimeout );
urlc.setRequestProperty( "Accept",
"text/html, image/png, image/jpeg,
image/gif, application/x-java-serialized-object, text/x-java-source,
text/xml, application/xml, "
+
"text/css,
application/x-java-jnlp-file, text/plain, application/zip,
application/octet-stream, *; q=.2, */*; q=.2" );
urlc.connect();
elapsed( "c:" );

System.out.println( ( (HttpURLConnection)
urlc).getResponseCode() );
elapsed("d:" );

long length = urlc.getContentLength();// -1 if not available

// O P E N _ S O U R C E, raw byte stream
InputStream is = urlc.getInputStream();
elapsed( "e:" );
 
E

EJP

Roedy said:
It looks like getResponse code blocks until the response from the
server.

It would be rather astonishing if it didn't, wouldn't it? The response
contains the response code, so of course it has to block.
 
D

Daniele Futtorovic

I did an experiment here in the output:

a: 4160
b: 4362600 <- . openConnection
c: 142284360 <- .connect
200
d: 229896880 <- .getResponseCode
e: 2611160 <-.getInputStream

So it looks like .connect does the work -- establishing the socket and
sending the header.

It looks like getResponse code blocks until the response from the
server.

Have a look at the source code for HttpURLConnection. It's not much
there, and it may be the specific implementation overrides what there
is. Nevertheless, that basic implementation calls getInputStream() in
getResponse{Code|Message}(). getInputStream() probably sets all the
stuff (resp code, resp headers) and gets stored (cached?) on the first
call. That last point would be interesting, by the way. Does the
data get downloaded wholly the first time or is the InputStream
maintained after the first \r\n\r\n? Presumably the latter.
 
R

Roedy Green

It would be rather astonishing if it didn't, wouldn't it? The response
contains the response code, so of course it has to block.

Granted it works that way any sane person would design it, but the
way the universe works, I could equally well expect to return -1 if
the response has not arrived yet. You would have to say, open the
InputStream first to do the block. If you read back in this thread,
others suspected it indeed did work this way.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top