Getting text from a URL

M

mic123

I am trying to read the text of a website using a URL object and a data
stream
It works well on CNN.com for example, but doesn't work well on:
http://www.collegehumor.com:80/video:1674301

How should I interpret the stream I'm getting?


I'm using the following code:

URL u;
InputStream is = null;
DataInputStream dis;
String s;

try {

u = new URL("http://www.collegehumor.com:80/video:1674301");
is = u.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
while ((s = dis.readLine()) != null) {
System.out.println(s);
}
}
catch (MalformedURLException mue) {
} catch (IOException ioe) {
} finally {
try {
is.close();
} catch (IOException ioe) {
}

} // end of 'finally' clause

} // end of main
 
M

mic123

Régis Décamps said:
What makes you think it does not work?
The fact instead of normal HTML text I'm getting gibbrish like this:
<?s?6²¿w¦???E?9¿$J´-e?I|/N|¶?^s???$$1¦??l«???· ?IQ²?v??¼d?X`???~8?tr????e??\~~????hm]??>????S??÷7??1?MB?4?B?H×?>jD?e??@×???;÷v?'S??J@X&vV??¬?³d?6??»#|¿x?h
¯?,£?¶?o??n¨??8cq?¾Y-?F|y7?2??????3??,?)o=·m
?RL?l¨?e6?I©7
As HTML?

I don't get exactly what you want to do, but have you considered
Jakarta HttpClient?
Thanks for the tip - will give it a shot
 
?

=?windows-1252?Q?Arne_Vajh=F8j?=

The fact instead of normal HTML text I'm getting gibbrish like this:
<?s?6²¿w¦???E?9¿$J´-e?I|/N|¶?^s???$$1¦??l«???· ?IQ²?v??¼d?X`???~8?tr????e??\~~????hm]??>????S??÷7??1?MB?4?B?H×?>jD?e??@×???;÷v?'S??J@X&vV??¬?³d?6??»#|¿x?h
¯?,£?¶?o??n¨??8cq?¾Y-?F|y7?2??????3??,?)o=·m
?RL?l¨?e6?I©7

Look as if that URL are returning its content GZIP'ed.

Try wrap the InputStream in a GZIPInputStream.

Arne
 
A

Andrew Thompson

I am trying to read the text of a website using a URL object and a data
stream
It works well on CNN.com for example, but doesn't work well on:
http://www.collegehumor.com:80/video:1674301

This source loads and displays (crudely) the web page
at that address.

<sscce>
import javax.swing.*;
import java.net.URL;

public class ShowURL {
public static void main(String[] args) {
String address = null;
if (args.length==0) {
address = JOptionPane.showInputDialog(null, "URL?");
} else {
address = args[0];
}
JEditorPane jep = null;
try {
URL url = new URL(address);
jep = new JEditorPane(url);
} catch(Exception e) {
jep = new JEditorPane();
jep.setText( e.toString() );
}
JScrollPane jsp = new JScrollPane(jep);
jsp.setPreferredSize(new java.awt.Dimension(400,300));
JOptionPane.showMessageDialog(null, jsp);
}
}
</sscce>

...so the data is readable, and it is a web-page.

Andrew T.
 
W

William Brogden

Régis Décamps said:
What makes you think it does not work?
The fact instead of normal HTML text I'm getting gibbrish like this:
<?s?6²¿w¦???E?9¿$J´-e?I|/N|¶?^s???$$1¦??l«???·?IQ²?v??¼d?X`???~8?tr????e??\~~????hm]??>????S??÷7??1?MB?4?B?H×?>jD?e??@×???;÷v?'S??J@X&vV??¬?³d?6??»#|¿x?h
¯?,£?¶?o??n¨??8cq?¾Y-?F|y7?2??????3??,?)o=·m
?RL?l¨?e6?I©7

As another poster already said, this is gzip encoded.

When I do this sort of thing I just grab the data stream to a byte[] -
then take a look at the headers to see what the encoding is when I have
the whole message.

I found that it is necessary to search for the GZIP signature bytes
to locate the start of the gzip stream after the headers.

Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,125
Latest member
VinayKumar Nevatia_
Top