Parsing HTML Source in Subdirectory/Auto-gen/CGI pages

Saunvit Pandya · Jun 24, 2004

I have a bit of a problem understanding how to retrieve the backend html
(i.e. what you would get if you clicked View Source in Internet Explorer) of
a webpage that is in a subdirectory of a website and not a CGI or Perl
Script.

For example, I want to read the source html of Yahoo Top Stories to parse
for a particular news string.

So, the domain is: news.yahoo.com
And the full URL is: http://news.yahoo.com/news?tmpl=index&cid=716
(which is a subdirectory of news.yahoo and is a autogen/CGI page due to the
various descriptors)

The code I am using right now is:

Socket s = new Socket("news.yahoo.com", 80);
OutputStream os = s.getOutputStream();
PrintStream ps = new PrintStream(os);
ps.println("GET /news?tmpl=index&cid=716"); //where GET is a server
command

InputStreamReader dr = new InputStreamReader(s.getInputStream());
BufferedReader br=new BufferedReader(dr);
String line = br.readLine();
System.out.println("line: " + line);

/* String Parsing Stuff Here */

Ok, so the problem with this is at that the GET command works perfectly
fine for some websites and for some CGI, but it bombs on others. In this
case, Yahoo! is using an Apache backend, and I have not had an easy time
accessing their subdirectories and auto-generated pages(usually with the
form of /q/ks?s= or that of the news page above).

I have scrutinized the API, the tutorials, and several good Java books
in vain. Any guidelines, suggestions, or help will be welcomed. Thanks in
advance for your help.

S Pandya
(e-mail address removed)

Murray · Jun 24, 2004

So, the domain is: news.yahoo.com

And the full URL is: http://news.yahoo.com/news?tmpl=index&cid=716
(which is a subdirectory of news.yahoo and is a autogen/CGI page due to the
various descriptors)

The code I am using right now is:

Socket s = new Socket("news.yahoo.com", 80);
OutputStream os = s.getOutputStream();
PrintStream ps = new PrintStream(os);
ps.println("GET /news?tmpl=index&cid=716"); //where GET is a server
command

Are you familiar with HttpURLConnection? Seems like you're trying to roll
your own HTTP protocol when you don't really need to ...

Saunvit Pandya · Jun 24, 2004

Hey, thanks for your suggestion.

. I am looking into HTTPURLConnection.
Do you have any tips on how I can use it to gain an input stream?

Thanks

Roedy Green · Jun 24, 2004

Are you familiar with HttpURLConnection? Seems like you're trying to roll
your own HTTP protocol when you don't really need to ...

for sample code on how to use it, see http://mindprod.com/fileio.html
and tell it you want to read/write GCI/HTTP

HTML::Template module in perl	2	Apr 9, 2007
Java Source Viewer	4	Sep 26, 2003
code hangs when reading from socket	2	Sep 25, 2004
Post HTML forms and reading results	9	May 16, 2004
basic socket server in unstable	5	Apr 23, 2007
Help needed with ServerSocket. What's wrong in my code ?	5	Jun 2, 2005
Eclipse memory setting and other tricks	0	Mar 7, 2004
Ruby Weekly News 24th April - 1st May 2005	1	May 7, 2005

Parsing HTML Source in Subdirectory/Auto-gen/CGI pages

Saunvit Pandya

Murray

Saunvit Pandya

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads