S
Saunvit Pandya
I have a bit of a problem understanding how to retrieve the backend html
(i.e. what you would get if you clicked View Source in Internet Explorer) of
a webpage that is in a subdirectory of a website and not a CGI or Perl
Script.
For example, I want to read the source html of Yahoo Top Stories to parse
for a particular news string.
So, the domain is: news.yahoo.com
And the full URL is: http://news.yahoo.com/news?tmpl=index&cid=716
(which is a subdirectory of news.yahoo and is a autogen/CGI page due to the
various descriptors)
The code I am using right now is:
Socket s = new Socket("news.yahoo.com", 80);
OutputStream os = s.getOutputStream();
PrintStream ps = new PrintStream(os);
ps.println("GET /news?tmpl=index&cid=716"); //where GET is a server
command
InputStreamReader dr = new InputStreamReader(s.getInputStream());
BufferedReader br=new BufferedReader(dr);
String line = br.readLine();
System.out.println("line: " + line);
/* String Parsing Stuff Here */
Ok, so the problem with this is at that the GET command works perfectly
fine for some websites and for some CGI, but it bombs on others. In this
case, Yahoo! is using an Apache backend, and I have not had an easy time
accessing their subdirectories and auto-generated pages(usually with the
form of /q/ks?s= or that of the news page above).
I have scrutinized the API, the tutorials, and several good Java books
in vain. Any guidelines, suggestions, or help will be welcomed. Thanks in
advance for your help.
S Pandya
(e-mail address removed)
(i.e. what you would get if you clicked View Source in Internet Explorer) of
a webpage that is in a subdirectory of a website and not a CGI or Perl
Script.
For example, I want to read the source html of Yahoo Top Stories to parse
for a particular news string.
So, the domain is: news.yahoo.com
And the full URL is: http://news.yahoo.com/news?tmpl=index&cid=716
(which is a subdirectory of news.yahoo and is a autogen/CGI page due to the
various descriptors)
The code I am using right now is:
Socket s = new Socket("news.yahoo.com", 80);
OutputStream os = s.getOutputStream();
PrintStream ps = new PrintStream(os);
ps.println("GET /news?tmpl=index&cid=716"); //where GET is a server
command
InputStreamReader dr = new InputStreamReader(s.getInputStream());
BufferedReader br=new BufferedReader(dr);
String line = br.readLine();
System.out.println("line: " + line);
/* String Parsing Stuff Here */
Ok, so the problem with this is at that the GET command works perfectly
fine for some websites and for some CGI, but it bombs on others. In this
case, Yahoo! is using an Apache backend, and I have not had an easy time
accessing their subdirectories and auto-generated pages(usually with the
form of /q/ks?s= or that of the news page above).
I have scrutinized the API, the tutorials, and several good Java books
in vain. Any guidelines, suggestions, or help will be welcomed. Thanks in
advance for your help.
S Pandya
(e-mail address removed)