Parsing HTML Source in Subdirectory/Auto-gen/CGI pages

S

Saunvit Pandya

I have a bit of a problem understanding how to retrieve the backend html
(i.e. what you would get if you clicked View Source in Internet Explorer) of
a webpage that is in a subdirectory of a website and not a CGI or Perl
Script.

For example, I want to read the source html of Yahoo Top Stories to parse
for a particular news string.

So, the domain is: news.yahoo.com
And the full URL is: http://news.yahoo.com/news?tmpl=index&cid=716
(which is a subdirectory of news.yahoo and is a autogen/CGI page due to the
various descriptors)

The code I am using right now is:

Socket s = new Socket("news.yahoo.com", 80);
OutputStream os = s.getOutputStream();
PrintStream ps = new PrintStream(os);
ps.println("GET /news?tmpl=index&cid=716"); //where GET is a server
command

InputStreamReader dr = new InputStreamReader(s.getInputStream());
BufferedReader br=new BufferedReader(dr);
String line = br.readLine();
System.out.println("line: " + line);

/* String Parsing Stuff Here */

Ok, so the problem with this is at that the GET command works perfectly
fine for some websites and for some CGI, but it bombs on others. In this
case, Yahoo! is using an Apache backend, and I have not had an easy time
accessing their subdirectories and auto-generated pages(usually with the
form of /q/ks?s= or that of the news page above).

I have scrutinized the API, the tutorials, and several good Java books
in vain. Any guidelines, suggestions, or help will be welcomed. Thanks in
advance for your help.

S Pandya
(e-mail address removed)
 
M

Murray

So, the domain is: news.yahoo.com
And the full URL is: http://news.yahoo.com/news?tmpl=index&cid=716
(which is a subdirectory of news.yahoo and is a autogen/CGI page due to the
various descriptors)

The code I am using right now is:

Socket s = new Socket("news.yahoo.com", 80);
OutputStream os = s.getOutputStream();
PrintStream ps = new PrintStream(os);
ps.println("GET /news?tmpl=index&cid=716"); //where GET is a server
command

Are you familiar with HttpURLConnection? Seems like you're trying to roll
your own HTTP protocol when you don't really need to ...
 
S

Saunvit Pandya

Hey, thanks for your suggestion. :). I am looking into HTTPURLConnection.
Do you have any tips on how I can use it to gain an input stream?

Thanks
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top