How to slurp/get the content of a URI?

Discussion in 'Java' started by Stefan Ram, Jul 20, 2008.

  1. Stefan Ram

    Stefan Ram Guest

    I wonder what the best/canonical/Javaish way to get/slurp
    (i.e., read the whole content into a CharSequence) a URI is.
    In Perl, there is:

    use LWP::Simple; $content = get( "http://example.com/" );

    Say, one wanted to implement LWP::Simple::get in Java.

    What is the best way to do so?

    I currently do this as follows (omitting some details, like
    exceptions, encodings, and close()-operations):

    Connect via an HttpURLConnection object:

    final java.net.URL url = new java.net.URL( uri.toString() );

    final java.net.HttpURLConnection httpURLConnection
    =( java.net.HttpURLConnection )url.openConnection();

    httpURLConnection.connect();

    Then, filling a StringBuilder from it:

    final java.io.InputStreamReader inputStreamReader
    = new java.io.InputStreamReader
    ( httpURLConnection.getInputStream(), "UTF-8" );

    final java.io.BufferedReader bufferedReader
    = new java.io.BufferedReader( inputStreamReader );

    java.lang.String line; while(( line = bufferedReader.readLine() )!= null )
    { stringBuilder.append( line ); stringBuilder.append( '\n' ); }

    Is this the best/usual/canonical/Javaish way to do it,
    or should I use anything else?
     
    Stefan Ram, Jul 20, 2008
    #1
    1. Advertising

  2. Stefan Ram

    Stefan Ram Guest

    -berlin.de (Stefan Ram) writes:
    >new java.io.InputStreamReader
    >( httpURLConnection.getInputStream(), "UTF-8" );


    A more specific question:

    Shouldn't I use the document encoding instead of »UTF-8«?

    But I will only know this after I have read the response!
    (Or, at least part of it.)

    So, should I adopt a two-pass read:
    Open with US-ASCII to get the document encoding,
    then open again with the document encoding?
     
    Stefan Ram, Jul 20, 2008
    #2
    1. Advertising

  3. Stefan Ram

    Mark Space Guest

    Stefan Ram wrote:
    > -berlin.de (Stefan Ram) writes:
    >> new java.io.InputStreamReader
    >> ( httpURLConnection.getInputStream(), "UTF-8" );

    >
    > A more specific question:
    >
    > Shouldn't I use the document encoding instead of »UTF-8«?


    The default for HTTP is "8859_1" (that's the Java charset name).
    There's a special protocol for negotiating a different charset, which
    you won't support because your get is to primitive.

    The server will either send you 8859.1 if it can, or it'll close the
    connection, I think.
     
    Mark Space, Jul 20, 2008
    #3
  4. Stefan Ram

    Mark Space Guest

    Mark Space wrote:
    > Stefan Ram wrote:
    >> -berlin.de (Stefan Ram) writes:
    >>> new java.io.InputStreamReader
    >>> ( httpURLConnection.getInputStream(), "UTF-8" );

    >>
    >> A more specific question:
    >>
    >> Shouldn't I use the document encoding instead of »UTF-8«?

    >
    > The default for HTTP is "8859_1" (that's the Java charset name). There's
    > a special protocol for negotiating a different charset, which you won't
    > support because your get is to primitive.
    >
    > The server will either send you 8859.1 if it can, or it'll close the
    > connection, I think.


    P.S. the openStream() method for URL seems to open the type of
    connection you need directly.

    BufferedReader bin = null;

    URL url = new URL( arg[0] );
    bin = new BufferedReader(
    new InputStreamReader( url.openStream() ));


    I think. Better check that. It's fewer lines though.
     
    Mark Space, Jul 20, 2008
    #4
  5. Stefan Ram

    Arne Vajhøj Guest

    Mark Space wrote:
    > Stefan Ram wrote:
    >> -berlin.de (Stefan Ram) writes:
    >>> new java.io.InputStreamReader
    >>> ( httpURLConnection.getInputStream(), "UTF-8" );

    >>
    >> A more specific question:
    >>
    >> Shouldn't I use the document encoding instead of »UTF-8«?

    >
    > The default for HTTP is "8859_1" (that's the Java charset name). There's
    > a special protocol for negotiating a different charset, which you won't
    > support because your get is to primitive.
    >
    > The server will either send you 8859.1 if it can, or it'll close the
    > connection, I think.


    What ?

    HttpURLConnection and its InputStream fetches bytes from the
    server. No negotiations possible.

    When the client needs to interpret the bytes it needs to
    decide on an encoding.

    The code snippet above creates an InputStreamReader expecting
    UTF-8 encoding.

    If it is known that is the encoding then it is fine. If the encoding
    is unknown it should be based on HTTP header and HTML META tag info.

    There are no default ISO-8859-1 in neither HTTP or Java. HTTP is
    always explicit and Java default is system specific.

    Arne
     
    Arne Vajhøj, Jul 20, 2008
    #5
  6. Stefan Ram

    Mark Space Guest

    Arne Vajhøj wrote:

    >
    > HttpURLConnection and its InputStream fetches bytes from the
    > server. No negotiations possible.


    I think that's what I'm saying. Although I'm no longer sure that
    HttpURLConnection doesn't fully support HTTP character sets. It might.


    > There are no default ISO-8859-1 in neither HTTP or Java. HTTP is
    > always explicit and Java default is system specific.


    For a socket, yes, there is no default encoding. For HTTP, I think that
    is not true. 8859-1 is the default if nothing is specified, and it is
    legal to leave out the charset encoding -- in both the GET and the response.

    I think, anyway. I could be all wrong about that.

    Stefan has a valid question: If the content type isn't specified until
    you read the header, and you don't know the content type, how do you
    know what to open the stream as? The answer I think is that it's
    defined to be 8859-1 by default.

    Let me see if I can dig something up...

    Content Negotiation for HTTP:
    <http://en.wikipedia.org/wiki/Content_negotiation>

    Some info on "Missing Charset" in the RFC:
    <http://tools.ietf.org/html/rfc2616>
    Search for 8859.


    Back to Java: Also, URLConnection() looks like it will allow one to read
    things like the content type and mime type before getting a Java
    InputStream to the content:

    URLConnection c = url.openConnection();
    String mimeType = c.getContentType();
    System.out.println( mimeType );

    And similarly for getContentEncoding();

    I gotta run. I hope I didn't booger things up too badly replying to
    Stefan. Apologies if I did.
     
    Mark Space, Jul 20, 2008
    #6
  7. Stefan Ram

    Arne Vajhøj Guest

    Mark Space wrote:
    > Arne Vajhøj wrote:
    >> There are no default ISO-8859-1 in neither HTTP or Java. HTTP is
    >> always explicit and Java default is system specific.

    >
    > For a socket, yes, there is no default encoding. For HTTP, I think that
    > is not true. 8859-1 is the default if nothing is specified, and it is
    > legal to leave out the charset encoding -- in both the GET and the
    > response.


    > Let me see if I can dig something up...
    >
    > Content Negotiation for HTTP:
    > <http://en.wikipedia.org/wiki/Content_negotiation>
    >
    > Some info on "Missing Charset" in the RFC:
    > <http://tools.ietf.org/html/rfc2616>
    > Search for 8859.


    You are right. If nothing is specified it means ISO-8859-1. Which
    is rather bad since the world is moving from ISO-8859-1 to UTF-8.

    > Stefan has a valid question: If the content type isn't specified until
    > you read the header, and you don't know the content type, how do you
    > know what to open the stream as? The answer I think is that it's
    > defined to be 8859-1 by default.
    >
    > Back to Java: Also, URLConnection() looks like it will allow one to read
    > things like the content type and mime type before getting a Java
    > InputStream to the content:
    >
    > URLConnection c = url.openConnection();
    > String mimeType = c.getContentType();
    > System.out.println( mimeType );
    >
    > And similarly for getContentEncoding();


    Encoding in HTTP header is easy, because the headers are US-ASCII, so
    the client can read the headers and determine the encoding before
    reading the body.

    Encoding in HTML META tag is not so nice.

    Arne
     
    Arne Vajhøj, Jul 20, 2008
    #7
  8. Stefan Ram

    Mark Space Guest

    Arne Vajhøj wrote:

    >
    > Encoding in HTTP header is easy, because the headers are US-ASCII, so
    > the client can read the headers and determine the encoding before
    > reading the body.
    >
    > Encoding in HTML META tag is not so nice.


    Yes, HTML != HTTP. Sorry if the original question was about HTML
    instead of HTTP, I may be out in left field here.
     
    Mark Space, Jul 20, 2008
    #8
  9. Stefan Ram

    Mark Space Guest

    Stefan Ram wrote:

    > Shouldn't I use the document encoding instead of »UTF-8«?
    >
    > But I will only know this after I have read the response!
    > (Or, at least part of it.)


    So I'm no expert, and I hope I'm not wasting your time by blathering,
    but the question is interesting to me so I did a bit of work on it.
    Here's what I have so far.


    static void method4() throws MalformedURLException, IOException {
    String TEST_URL =
    "http://cnn.com";
    URL url = new URL(TEST_URL);
    URLConnection c = url.openConnection();
    String type = c.getContentType();
    System.out.println("Mime type: " + type );
    if( type == null || type.contains("text") )
    {
    String enc = c.getContentEncoding();
    System.out.println( "Encoding: " + enc );
    if( enc == null )
    {
    enc = "ISO-8859-1";
    }
    InputStreamReader inr = new InputStreamReader(

    c.getInputStream(),
    enc ); // I have no idea if http encoding
    strings // will work here
    List<CharBuffer> result = new ArrayList<CharBuffer>();
    int byteCount = 0;
    for( ;; )
    {
    int read;
    CharBuffer cb = CharBuffer.allocate( 4 * 1024 );
    if( ( read = inr.read( cb )) != -1 )
    {
    byteCount += read;
    result.add( cb );
    }
    else
    {
    break;
    }
    }
    System.out.println( "Read: " + byteCount );
    }
    else // binary
    {
    System.out.println("binary...");
    }
    }

    Some other thoughts:

    1. If the URL string depends on user input, you may have to use
    URLEncoder if the user input goes in the parameter part of the URL.

    2. Don't forget that other protocols besides HTTP exist. The Java API
    also supports FTP and JAR I believe. You might get one of those instead
    of HTTP. You may wish to check the protocol expressly if you don't set
    it yourself.

    3. Both mime type and the character encoding may be null. The defaults
    are "text" and ISO-8859-1 respectively, but there are also "guess"
    methods in the URLConnection object.

    4. If you don't have text, you might have an image. It might be nice to
    return an Image in that case. I didn't get that far though.

    5. I can't find any expandable buffers for Java. StringBuilder or
    StringWriter seem like a good idea. I made my own by stuffing
    CharBuffers into a List. The idea is to avoid testing each character
    for an end-of-line, which readLine() must do. Hopefully the CharBuffer
    is faster.

    6. You could also read the data raw (ByteBuffer) and decide what to do
    with it later. This might be more in the spirit of a "slurp" operation.

    7. I looked for a way to get a channel from the URLConnection and didn't
    find one. I think this is a defect in the Java API, myself. Using
    direct buffers might be a big performance win here. You'll need a raw
    socket for that I guess.
     
    Mark Space, Jul 20, 2008
    #9
  10. Stefan Ram

    Stefan Ram Guest

    Mark Space <> writes:
    >String enc = c.getContentEncoding();
    >System.out.println( "Encoding: " + enc );
    >if( enc == null )
    >{
    >enc = "ISO-8859-1";


    In spite of its name, getContentEncoding() does /not/
    designate the content character encoding.

    It designates the HTTP »content-encoding« header.
    This designates the HTTP compression method used.
    For example, it might be »gzip«. See

    http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11
     
    Stefan Ram, Jul 20, 2008
    #10
  11. Stefan Ram

    Tom Anderson Guest

    On Sat, 19 Jul 2008, Mark Space wrote:

    > Mark Space wrote:
    >> Stefan Ram wrote:
    >>> -berlin.de (Stefan Ram) writes:
    >>>> new java.io.InputStreamReader
    >>>> ( httpURLConnection.getInputStream(), "UTF-8" );
    >>>
    >>> A more specific question:
    >>>
    >>> Shouldn't I use the document encoding instead of »UTF-8«?

    >>
    >> The default for HTTP is "8859_1" (that's the Java charset name).
    >> There's a special protocol for negotiating a different charset, which
    >> you won't support because your get is to primitive.
    >>
    >> The server will either send you 8859.1 if it can, or it'll close the
    >> connection, I think.


    My understanding is that the server may, in pretty much any situation,
    send whatever charset it likes, as long as it declares it in the
    content-type header.

    > P.S. the openStream() method for URL seems to open the type of connection
    > you need directly.
    >
    > BufferedReader bin = null;
    >
    > URL url = new URL( arg[0] );
    > bin = new BufferedReader(
    > new InputStreamReader( url.openStream() ));
    >
    > I think. Better check that.


    You're absolutely right.

    A slightly more correct approach (which might have been expounded
    downthread already) would be to use a URLConnection, get the content-type,
    parse it to identify a charset, and then use that to configure the
    InputStreamReader correctly.

    Sadly, and shockingly, there doesn't seem to be anything to parse
    content-type headers in the standard library. There is a
    javax.mail.internet.ContentType in J2EE, though, and it's not too hard to
    write yourself.

    There's also an intriguing getContent() method that sounds like it should
    be even closer to what Stefan wanted - it downloads the bytes, then uses
    the content-type to convert them into an object. However, it's not
    entirely clear exactly what kind of object you're supposed to get, which
    makes it more or less useless. In practice, getting HTML text gives you an
    InputStream, and getting an image gives you a
    java.awt.image.ImageProducer. That's not enormously useful here.

    tom

    --
    Sometimes it takes a madman like Iggy Pop before you can SEE the logic
    really working.
     
    Tom Anderson, Jul 22, 2008
    #11
  12. Stefan Ram

    Stefan Ram Guest

    Tom Anderson <> writes:
    >Sometimes it takes a madman like Iggy Pop before you can SEE


    I am wondering whether I should attend his concert
    with the Stooges at the end of the next month.

    Regarding the charset parameter of MIME types, there is:

    java.awt.datatransfer.MimeTypeParameterList

    But it is not a public class. So much for reuse.
     
    Stefan Ram, Jul 22, 2008
    #12
  13. Stefan Ram

    Mark Space Guest

    Stefan Ram wrote:
    > Mark Space <> writes:
    >> String enc = c.getContentEncoding();
    >> System.out.println( "Encoding: " + enc );
    >> if( enc == null )
    >> {
    >> enc = "ISO-8859-1";

    >
    > In spite of its name, getContentEncoding() does /not/
    > designate the content character encoding.


    Yup, I shoulda read the docs better. I'll correct my example, thanks.
     
    Mark Space, Jul 22, 2008
    #13
  14. Stefan Ram

    Arne Vajhøj Guest

    Mark Space wrote:
    > So I'm no expert, and I hope I'm not wasting your time by blathering,
    > but the question is interesting to me so I did a bit of work on it.
    > Here's what I have so far.
    >
    > static void method4() throws MalformedURLException, IOException {
    > String TEST_URL =
    > "http://cnn.com";
    > URL url = new URL(TEST_URL);
    > URLConnection c = url.openConnection();
    > String type = c.getContentType();
    > System.out.println("Mime type: " + type );
    > if( type == null || type.contains("text") )
    > {
    > String enc = c.getContentEncoding();
    > System.out.println( "Encoding: " + enc );
    > if( enc == null )
    > {
    > enc = "ISO-8859-1";
    > }
    > InputStreamReader inr = new InputStreamReader(
    > c.getInputStream(),
    > enc ); // I have no idea if http encoding
    > strings // will work here
    > List<CharBuffer> result = new ArrayList<CharBuffer>();
    > int byteCount = 0;
    > for( ;; )
    > {
    > int read;
    > CharBuffer cb = CharBuffer.allocate( 4 * 1024 );
    > if( ( read = inr.read( cb )) != -1 )
    > {
    > byteCount += read;
    > result.add( cb );
    > }
    > else
    > {
    > break;
    > }
    > }
    > System.out.println( "Read: " + byteCount );
    > }
    > else // binary
    > {
    > System.out.println("binary...");
    > }
    > }


    You need to also handle the META HTTP-EQUIV way of specifying charset.

    My suggestion for code:

    import java.io.IOException;
    import java.io.InputStream;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    public class HttpDownloadCharset {
    private static Pattern encpat =
    Pattern.compile("charset=([A-Za-z0-9-]+)", Pattern.CASE_INSENSITIVE);
    private static String parseContentType(String contenttype) {
    Matcher m = encpat.matcher(contenttype);
    if(m.find()) {
    return m.group(1);
    } else {
    return "ISO-8859-1";
    }
    }
    private static Pattern metaencpat =
    Pattern.compile("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>",
    Pattern.CASE_INSENSITIVE);
    private static String parseMetaContentType(String html, String
    defenc) {
    Matcher m = metaencpat.matcher(html);
    if(m.find()) {
    return parseContentType(m.group(1));
    } else {
    return defenc;
    }
    }
    private static final int DEFAULT_BUFSIZ = 1000000;
    public static String download(String urlstr) throws IOException {
    URL url = new URL(urlstr);
    HttpURLConnection con = (HttpURLConnection)url.openConnection();
    con.connect();
    if (con.getResponseCode() == HttpURLConnection.HTTP_OK) {
    String enc = parseContentType(con.getContentType());
    int bufsiz = con.getContentLength();
    if(bufsiz < 0) {
    bufsiz = DEFAULT_BUFSIZ;
    }
    byte[] buf = new byte[bufsiz];
    InputStream is = con.getInputStream();
    int ix = 0;
    int n;
    while((n = is.read(buf, ix, buf.length - ix)) > 0) {
    ix += n;
    }
    is.close();
    con.disconnect();
    String temp = new String(buf, "US-ASCII");
    enc = parseMetaContentType(temp, enc);
    return new String(buf, enc);
    } else {
    con.disconnect();
    throw new IllegalArgumentException("URL " + urlstr + "
    returned " + con.getResponseMessage());
    }
    }
    }

    Arne
     
    Arne Vajhøj, Jul 27, 2008
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Simon Harris
    Replies:
    0
    Views:
    6,492
    Simon Harris
    May 10, 2005
  2. Stanimir Stamenkov
    Replies:
    1
    Views:
    2,539
    Stanimir Stamenkov
    Aug 17, 2005
  3. Pavel
    Replies:
    2
    Views:
    1,733
    Peter Flynn
    Aug 4, 2004
  4. etheriau
    Replies:
    1
    Views:
    706
    Pavel
    Aug 23, 2004
  5. Turbo
    Replies:
    2
    Views:
    182
    Turbo
    Nov 1, 2006
Loading...

Share This Page