Trying to GET google with socket....problem

H

Hey You

Well I don't know why the socket can't connect to Google. Here is my
source code:

require 'socket'
h = TCPSocket.new('www.google.ca',80)
h.print "GET /index.html HTTP/1.0\n\n"
a = h.read
puts a

I tried changing the HTTP to 1.1 but it still doesn't work.
 
M

Michael Gorsuch

I just ran this code in irb, and it worked without issue.

Can you provide the specific exception or unexpected results?
 
M

Michael Gorsuch

Also, can you provide the platform that you are using? I was using OS X.
 
R

Ryan Davis

Well I don't know why the socket can't connect to Google. Here is my
source code:

require 'socket'
h = TCPSocket.new('www.google.ca',80)
h.print "GET /index.html HTTP/1.0\n\n"
a = h.read
puts a

If you just want to get google (or whatever), use:

ruby -ropen-uri -e 'puts URI.parse("http://www.google.com/
index.html").read'

If you want to know the inner-workings of HTTP clients and servers,
use the above and trace it backwards. There is a lot of good code in
there.
 
H

Hey You

Michael said:
I just ran this code in irb, and it worked without issue.

Can you provide the specific exception or unexpected results?
Well I just ran the code and got this:

HTTP/1.0 302 Found

Location: http://www.google.ca/index.html

Cache-Control: private

Set-Cookie:
PREF=ID=e20f9edec5958042:TM=1175979001:LM=1175979001:S=shwmC1m6Amdg20nV;
expires=Sun, 17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com

Content-Type: text/html

Server: GWS/2.1

Content-Length: 228

Date: Sat, 07 Apr 2007 20:50:01 GMT

Connection: Keep-Alive



<HTML><HEAD><meta http-equiv="content-type"
content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="http://www.google.ca/index.html">here</A>.

</BODY></HTML>

Also I would like to stick to using sockets instead of other HTTP
clients :).
 
H

Hey You

Michael said:
Also, can you provide the platform that you are using? I was using OS
X.
Well I don't know what you meant right there but I'm using Windows XP.
 
M

Michael Gorsuch

OK, so you are getting a response back from the server.

I have no idea why you're getting a redirect from them, but you are getting a proper response over your socket.
 
H

Hey You

Michael said:
OK, so you are getting a response back from the server.

I have no idea why you're getting a redirect from them, but you are
getting a proper response over your socket.
Well thank you for the answer :). The thing is that it's weird that even
when I put the host as google.ca it still redirects me to google.ca.
Well thank you to everyone that has helped me and I appreciate it but I
am wondering something else now: Why when I put HTTP/1.1 the program
loads but it just stays blank, not doing anything.
 
P

Philipp Taprogge

Hi!

The answers to both of your questions is simple... :)

Thus spake Hey You on 04/07/2007 11:51 PM:
Well thank you for the answer :). The thing is that it's weird that even
when I put the host as google.ca it still redirects me to google.ca.

That's because google redirects you to your localized version of
google and you did not specify the hostname in your get. You open a
socket to www.google.ca, but you only tell it to deliver some
"index.html". If that machine hosted multiple domains (which in fact
it does), it would not know whether to send you
www.google.ca/index.html or perhaps www.google.de/index.html.
So it informs you that it has an "/index.html" for you which it
figures might best suit your needs and that this page can be found
by issuing the following HTTP command:

GET www.google.ca/index.html HTTP/1.0\n\n
Well thank you to everyone that has helped me and I appreciate it but I
am wondering something else now: Why when I put HTTP/1.1 the program
loads but it just stays blank, not doing anything.

The answer to that question is even simpler:
In HTTP/1.0, you open a socket, issue a request, get a response and
close the socket again for each and every single item you need. You
open a socket for the html-page itself, another one to request an
image specified in that page and so on. So after each request, the
socket is closed by the server.

When you specify HTTP/1.1, you have another option: pipelining. When
you request a resource via HTTP/1.1, a compliant server MAY keep the
socket open for you after it's response so that you might specify
another request without having to open a whole new socket. If the
server does this, it is the client's responsibility to close the
socket when it does not require any more data.
Try it: open up a telnet connection to www.google.ca and issue your
request as HTTP/1.0. The socket will close immediately after the
response from the server.
Now do the same thing again but specify HTTP/1.1. This time the
socket stays open and your can issue another request (or the same
request again to keep things simple.

For further information I suggest you read rfc1945 and rfc2616
respectively.

HTH, HAND,

Phil
 
H

Hey You

Philipp said:
Hi!

The answers to both of your questions is simple... :)

Thus spake Hey You on 04/07/2007 11:51 PM:

That's because google redirects you to your localized version of
google and you did not specify the hostname in your get. You open a
socket to www.google.ca, but you only tell it to deliver some
"index.html". If that machine hosted multiple domains (which in fact
it does), it would not know whether to send you
www.google.ca/index.html or perhaps www.google.de/index.html.
So it informs you that it has an "/index.html" for you which it
figures might best suit your needs and that this page can be found
by issuing the following HTTP command:

GET www.google.ca/index.html HTTP/1.0\n\n


The answer to that question is even simpler:
In HTTP/1.0, you open a socket, issue a request, get a response and
close the socket again for each and every single item you need. You
open a socket for the html-page itself, another one to request an
image specified in that page and so on. So after each request, the
socket is closed by the server.

When you specify HTTP/1.1, you have another option: pipelining. When
you request a resource via HTTP/1.1, a compliant server MAY keep the
socket open for you after it's response so that you might specify
another request without having to open a whole new socket. If the
server does this, it is the client's responsibility to close the
socket when it does not require any more data.
Try it: open up a telnet connection to www.google.ca and issue your
request as HTTP/1.0. The socket will close immediately after the
response from the server.
Now do the same thing again but specify HTTP/1.1. This time the
socket stays open and your can issue another request (or the same
request again to keep things simple.

For further information I suggest you read rfc1945 and rfc2616
respectively.

HTH, HAND,

Phil
Thank you a lot Phil! I have learned a lot from you like how to POST
data (Yup, I learned) and much more and I am very grateful for all the
help you have given me. It makes sense why it didn't connect to
google.ca and I learned how to fix it right after my last post but I had
to go offline. I have also read RFC2616 but only bits and pieces of what
I have read are stuck in my head so I will keep re-reading it to learn
more. I will also read RFC1945 and I'm sorry for my newbish posts. It's
not that I'm lazy because I really am a hard worker but it's just that I
needed someone to point me to the right direction and that is what you
did :).
 
B

Brian Candler

Well I don't know why the socket can't connect to Google. Here is my
source code:

require 'socket'
h = TCPSocket.new('www.google.ca',80)
h.print "GET /index.html HTTP/1.0\n\n"
a = h.read
puts a

I tried changing the HTTP to 1.1 but it still doesn't work.

Two problems:
(1) Line terminator for HTTP is \r\n not \n
(2) You have not supplied a Host: header

h.print "GET /index.html HTTP/1.0\r\nHost: www.google.ca\r\n\r\n"

I say again: you must read and understand RFC 2616.

This documents HTTP/1.1, which has gained a lot of features. You could try
reading the earlier RFCs for HTTP/1.0 or HTTP/0.9 for a simplified protocol.

B.
 
H

Hey You

Brian said:
Two problems:
(1) Line terminator for HTTP is \r\n not \n
(2) You have not supplied a Host: header

h.print "GET /index.html HTTP/1.0\r\nHost: www.google.ca\r\n\r\n"

I say again: you must read and understand RFC 2616.

This documents HTTP/1.1, which has gained a lot of features. You could
try
reading the earlier RFCs for HTTP/1.0 or HTTP/0.9 for a simplified
protocol.

B.

Have you read what I last posted? Or did you just ignore it and gave me
the answer to a already answered question? Yes I have read RFC2616 more
than once and I do understand a lot of it but not all stays on my head
in the few times I read the document. I don't know but I have read in a
lot of places that for a line terminator you can also use "\n\n" and it
seems to work fine. Also putting the Host header or adding the full
domain to the code such as "GET www.google.ca/index.html" both specifies
which host we want so I don't see why change them.
 
B

Brian Candler

Have you read what I last posted? Or did you just ignore it and gave me
the answer to a already answered question? Yes I have read RFC2616 more
than once and I do understand a lot of it but not all stays on my head
in the few times I read the document. I don't know but I have read in a
lot of places that for a line terminator you can also use "\n\n" and it
seems to work fine.

Read RFC 2616 section 2.2:

" HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all
protocol elements except the entity-body (see appendix 19.3 for
tolerant applications)."

and appendix 19.3 says:

" The line terminator for message-header fields is the sequence CRLF.
However, we recommend that applications, when parsing such headers,
recognize a single LF as a line terminator and ignore the leading CR."

So the upshot is: you're sending a malformed request, but some servers may
honour it.
Also putting the Host header or adding the full
domain to the code such as "GET www.google.ca/index.html" both specifies
which host we want so I don't see why change them.

No, "GET www.google.ca/index.html" is a completely malformed request and
will be rejected. In any case this is different to the GET request you
actually sent, quoted at the very top of this posting.

The hostname is *never* supplied as part of the GET line.

Of course you supplied it to Ruby's TCPSocket.new method, but at that point
the hostname is converted to an IP address before the connection is opened.
The name is not passed to the far end and therefore you must provide a Host:
header.

I'm sorry, but I'm dropping out of this conversation now. Your response was
arrogant. If you know nothing about HTTP, then I suggest you don't go around
telling people who know something about HTTP that they are wrong.

Regards,

Brian.
 
X

Xavier Noria

Have you read what I last posted? Or did you just ignore it and
gave me
the answer to a already answered question? Yes I have read RFC2616
more
than once and I do understand a lot of it but not all stays on my head
in the few times I read the document. I don't know but I have read
in a
lot of places that for a line terminator you can also use "\n\n"
and it
seems to work fine.

Perhaps you read that in a CGI context?

"The server MUST translate the header data from the CGI header field
syntax to the HTTP header field syntax if these differ. For example,
the character sequence for newline (such as Unix's ASCII NL) used by
CGI scripts may not be the same as that used by HTTP (ASCII CR
followed by LF)."

That's what allows CGIs to ouput things like

print "Content-Type: text/plain\n\n"

and forget about CRLFs.

-- fxn
 
Z

Zephyr Pellerin

Brian said:
Two problems:
(1) Line terminator for HTTP is \r\n not \n
(2) You have not supplied a Host: header

h.print "GET /index.html HTTP/1.0\r\nHost: www.google.ca\r\n\r\n"

I say again: you must read and understand RFC 2616.

This documents HTTP/1.1, which has gained a lot of features. You could try
reading the earlier RFCs for HTTP/1.0 or HTTP/0.9 for a simplified protocol.

B.
That would be the issue.
 
G

Gary Wright

Also putting the Host header or adding the full
domain to the code such as "GET www.google.ca/index.html" both
specifies
which host we want so I don't see why change them.

The URI provided in the GET request can be an absolute URI only if
the request is going to a proxy server. In *that* case the GET would
look like:

GET http://proxy.domain.com/index.html

Otherwise the URI must be an absolute path (i.e., a path starting
with '/').
In that case the GET would look like:

GET /index.html

The problem with only having the path is that a web server that is
hosting
several websites can't determine from the GET request which site the
request pertains to. The incoming TCP connections only have a
destination
IP address, not a destination domain name. The solution to this problem
is the "Host:" header. By looking at the "Host:" header, the web server
can multiplex several websites at the same IP address. Without the
Host:
header you would have to have a separate IP address for every website.

So your request should be sent as:

GET /index.html HTTP/1.0
Host: www.google.ca
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top