Html download challenge

Discussion in 'Java' started by Paul Battersby, Jun 30, 2005.

  1. I've spent days poking around the internet, reading help information, trying
    to find working source code but no luck so far.

    My problem, on the surface and to someone who knows what he/she is doing,
    should be easy to solve.

    All I want to do is download the HTML from the following url:

    http://www.google.com/search?q=business

    Sounds simple. I can type that into a browser and I will get a page full of
    information. I try to download that using a Java program, and the server
    seems to know I am not a browser (my code works with other Urls just fine).
    I figure I need to pass some sort of header information or something so that
    I appear to be a browser.

    So, what I'm looking for, if anyone is up to the challenge, is a small piece
    of Java source code that is capable of downloading the HTML from the above
    mentioned url and printing it to the screen.

    On my own, I think I'm looking at a pretty big learning curve (low level
    HTTP protocol) to sort this out.

    Any help is of course greatly appreciated.
     
    Paul Battersby, Jun 30, 2005
    #1
    1. Advertising

  2. you are probably doing something else wrong. I just tried with telnet,
    and it works.
    Just to be sure, do the following:

    telnet www.google.com 80

    get http://www.google.com/search?q=business http/1.1
    <return a second time>

    you should get the page


    Paul Battersby wrote:
    > I've spent days poking around the internet, reading help information, trying
    > to find working source code but no luck so far.
    >
    > My problem, on the surface and to someone who knows what he/she is doing,
    > should be easy to solve.
    >
    > All I want to do is download the HTML from the following url:
    >
    > http://www.google.com/search?q=business
    >
    > Sounds simple. I can type that into a browser and I will get a page full of
    > information. I try to download that using a Java program, and the server
    > seems to know I am not a browser (my code works with other Urls just fine).
    > I figure I need to pass some sort of header information or something so that
    > I appear to be a browser.
    >
    > So, what I'm looking for, if anyone is up to the challenge, is a small piece
    > of Java source code that is capable of downloading the HTML from the above
    > mentioned url and printing it to the screen.
    >
    > On my own, I think I'm looking at a pretty big learning curve (low level
    > HTTP protocol) to sort this out.
    >
    > Any help is of course greatly appreciated.
    >
    >
     
    Andrea Desole, Jun 30, 2005
    #2
    1. Advertising

  3. Thanks but I don't really understand. I need to be able to do this within a
    java program. I think what you've suggested is using a telnet application.

    "Andrea Desole" <> wrote in message
    news:9a316$42c3f740$d468cb3c$-service.com...
    > you are probably doing something else wrong. I just tried with telnet,
    > and it works.
    > Just to be sure, do the following:
    >
    > telnet www.google.com 80
    >
    > get http://www.google.com/search?q=business http/1.1
    > <return a second time>
    >
    > you should get the page
    >
     
    Paul Battersby, Jun 30, 2005
    #3
  4. Paul Battersby wrote:
    > I've spent days poking around the internet, reading help information, trying
    > to find working source code but no luck so far.
    >
    > My problem, on the surface and to someone who knows what he/she is doing,
    > should be easy to solve.
    >
    > All I want to do is download the HTML from the following url:
    >
    > http://www.google.com/search?q=business
    >
    > Sounds simple. I can type that into a browser and I will get a page full of
    > information. I try to download that using a Java program, and the server
    > seems to know I am not a browser (my code works with other Urls just fine).
    > I figure I need to pass some sort of header information or something so that
    > I appear to be a browser.
    >
    > So, what I'm looking for, if anyone is up to the challenge, is a small piece
    > of Java source code that is capable of downloading the HTML from the above
    > mentioned url and printing it to the screen.
    >
    > On my own, I think I'm looking at a pretty big learning curve (low level
    > HTTP protocol) to sort this out.
    >
    > Any help is of course greatly appreciated.
    >
    >


    There is an alternative approach. Google has a Java API. See
    http://www.google.com/apis/.

    The licensing limits you to 1000 queries per day, and specifies
    personal, non-commercial use only. I assume they are trying to prevent
    anyone from constructing a rival search site with their own ads but
    Google's search results.

    As long as you meet the licensing restrictions, it is MUCH easier to
    access the results using their API than by trying to parse their web
    pages, even if you can get hold of them.

    Patricia
     
    Patricia Shanahan, Jun 30, 2005
    #4
  5. > Google has a Java API

    Thanks, I'll keep it in mind. That might be what I end up doing but I would
    like to avoid any limitations it imposes.

    > MUCH easier to access the results using their API than by trying to parse

    their web

    Parsing their web page, I can already do. That part was easy. Downloading
    the HTML outside of a browser I can't do. I know that what I want to do is
    possible. HotJava does it (a browser written in Java). Unfortunately the
    source code is not available so I can't see how it is done.


    "Patricia Shanahan" <> wrote in message
    news:jQSwe.11335$...

    > There is an alternative approach. Google has a Java API. See
    > http://www.google.com/apis/.
    >
    > The licensing limits you to 1000 queries per day, and specifies
    > personal, non-commercial use only. I assume they are trying to prevent
    > anyone from constructing a rival search site with their own ads but
    > Google's search results.
    >
    > As long as you meet the licensing restrictions, it is MUCH easier to
    > access the results using their API than by trying to parse their web
    > pages, even if you can get hold of them.
    >
    > Patricia
     
    Paul Battersby, Jun 30, 2005
    #5
  6. Paul Battersby wrote:
    > Thanks but I don't really understand. I need to be able to do this within a
    > java program. I think what you've suggested is using a telnet application.


    not really. What I'm saying is that, since it works using telnet, it
    also has to work with a regular java URL connection without adding any
    extra information. In my telnet I only used a get request; no special
    headers.
    At the beginning I thought that Google might check something like the
    user-agent header, but now I find it hard to believe it doesn't work.
    Maybe I'll try later on with a Java application
     
    Andrea Desole, Jun 30, 2005
    #6
  7. That would be my guess, yes. The secret that I have been unable to uncover
    is what properties do I need to set to make it work.

    "Tor Iver Wilhelmsen" <> wrote in message
    news:...
    > "Paul Battersby" <> writes:
    >
    > > I figure I need to pass some sort of header information or something so

    that
    > > I appear to be a browser.

    >
    > So you are looking for URLConnection.setRequestProperty("User-Agent",
    > "some browser string").
     
    Paul Battersby, Jun 30, 2005
    #7
  8. "Paul Battersby" <> writes:

    > I figure I need to pass some sort of header information or something so that
    > I appear to be a browser.


    So you are looking for URLConnection.setRequestProperty("User-Agent",
    "some browser string").
     
    Tor Iver Wilhelmsen, Jun 30, 2005
    #8
  9. "Andrea Desole" <> wrote in message
    news:d3c7b$42c3fe4d$d468cb3c$-service.com...

    > I find it hard to believe it doesn't work.
    > Maybe I'll try later on with a Java application


    Here is some sample code. You will see me loading from 2 urls. The first one
    works. The second (the one I care about) does not even though my browser
    (Internet Explorer) has no trouble with the url.

    I get this error:

    Exception in thread "main" java.io.IOException: Server returned HTTP
    response co
    de: 403 for URL: http://www.google.com/search?q=business
    at
    sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown So
    urce)
    at java.net.URL.openStream(Unknown Source)
    at SimpleTest.loadPage(SimpleTest.java:14)
    at SimpleTest.main(SimpleTest.java:30)

    If you can see something simple that I'm doing wrong, that would be great.

    // -------------------- program begins

    import java.net.*;
    import java.io.*;

    public class SimpleTest {

    public String loadPage(String webAddr) throws Exception {
    String inputLine;
    BufferedReader in;
    URL targetUrl;

    targetUrl = new URL(webAddr);
    StringBuffer htmlString = new StringBuffer();

    in = new BufferedReader( new InputStreamReader(
    targetUrl.openStream()));

    while ((inputLine = in.readLine()) != null) {
    htmlString.append(inputLine);
    }

    return htmlString.toString();

    }

    static public void main(String[] args) throws Exception {
    SimpleTest test = new SimpleTest();

    System.out.println("----------------------------------------------");
    System.out.println(test.loadPage("http://google.com"));
    System.out.println("----------------------------------------------");

    System.out.println(test.loadPage("http://google.com/search?q=business"));

    }
    }

    // ------------ program ends
     
    Paul Battersby, Jun 30, 2005
    #9
  10. Paul Battersby

    Roedy Green Guest

    On Thu, 30 Jun 2005 09:26:38 -0400, "Paul Battersby"
    <> wrote or quoted :

    >Sounds simple. I can type that into a browser and I will get a page full of
    >information. I try to download that using a Java program, and the server
    >seems to know I am not a browser (my code works with other Urls just fine).
    >I figure I need to pass some sort of header information or something so that
    >I appear to be a browser.


    Use http://mindprod.com/jgloss/ethereal.com to see what sort of get
    packets the browser is composing. Just pad your header with those crap
    fields and they will have no way of knowing you are not a browser.

    I did some screenscraping and I eventually got caught. I was just too
    regular. I did not even know they would object.

    Why do people publicly post information if they don't want you to make
    use of it.

    see http://mindprod.com/jgloss/urlconnection for how to write such
    code. This is very old code that uses low level.


    here is a little one shot scraper I did that I no longer use.

    /* Oanda complained, so I had to discontinue this.
    * Requests a currency conversion from Oanda.com
    * I prentend to be a browser, get the page and extract the part I
    want.
    * copyright (c) 1998-2005 Roedy Green, Canadian Mind Products
    * #327 - 964 Heywood Avenue
    * Victoria, BC Canada V8V 2Y5
    * tel: (250) 361-9093
    * mailto:
    * http://mindprod.com
    */
    package com.mindprod.currcon;
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.FileWriter;
    import java.io.IOException;

    import com.mindprod.voter.CGIRequest;
    import com.mindprod.voter.CGIget;

    public class Oanda
    {


    private static final String EmbeddedCopyright =
    "copyright (c) 2003-2005 Roedy Green, Canadian Mind Products,
    http://mindprod.com";

    // c o n f i g u r a t i o n s t r i n g s


    /**
    * Site URL to process the cgi script. without http:// on front
    */
    final static String host = "www.oanda.com";

    /**
    * Name of CGI Script to process this vote, namely the ACTION
    parameter, without host.
    * absolute name on host.
    */
    final static String relativeURL = "/convert/fxdaily";

    /**
    * get list of currencies to fetch, and glue them together
    separated with underscores.
    *
    * @return String of currency codes wanted, separated
    * by underscores. e.g. CAD_USD_EUR
    */
    public static String getWantedCurrencies () throws IOException
    {
    BufferedReader r = new BufferedReader ( new FileReader (
    "oanda.wanted" ), 4096 );
    StringBuffer sb = new StringBuffer ( 700 );
    String line;
    while ( ( line = r.readLine () ) != null )
    {
    sb.append ( "_" );
    sb.append ( line );
    }
    return sb.toString().substring( 1 );
    }


    /**
    * extract the useful CSV info out of the web page.
    *
    * @param haystack the entire webpage
    * @return Extracted goodies, just the CSV data.
    */
    static String extractGoodies ( String haystack )
    {
    /* Result that comes back in embedded a large web page.
    We find it by
    <PRE><font face=Verdana size=2>Currency,Code,USD/1 Unit,Units/1
    USD

    Canadian Dollar,CAD,0.7283,1.3737
    Swiss Franc,CHF,0.7739,1.2933
    British Pound,GBP,1.6325,0.6127
    Japanese Yen,JPY,0.008562,116.9
    </TD></TR></font></PRE>
    */
    String lookFor = "<PRE><font face=Verdana
    size=2>Currency,Code,USD/1 Unit,Units/1 USD";
    int startGoodies = haystack.indexOf( lookFor );
    if ( startGoodies < 0 )
    {
    System.out.println("failure. Oanda format change.");
    System.exit(1);
    return null;
    }
    else
    {
    // bypass junk on front
    haystack = haystack.substring( startGoodies +
    lookFor.length() + 2 );
    int endGoodies = haystack.indexOf( "</TD>" );
    if ( endGoodies < 0 )
    {
    System.out.println("failure. Oanda format change.");
    System.exit(1);
    return null;
    }
    return haystack.substring( 0, endGoodies );
    }
    }

    /**
    * Save the results in the oanda.csv file.
    *
    * @param result Results to save, in csv format.
    *
    * @exception IOException
    */
    static void save ( String result ) throws IOException
    {
    // save result in oand.csv ready for further processing.
    FileWriter w = new FileWriter ( "oanda.csv" );
    w.write( result );
    w.close();
    }
    /**
    * Connect and send request
    */
    public static void main (String[] args)
    {
    try
    {
    String currencies = getWantedCurrencies();

    // prepare http parms to server and get return data
    CGIRequest p = new CGIRequest( 2000 );
    // order appears to matter.
    p.appendCGIPair( "value", "1" );
    // leave out date to get today's date.
    p.appendCGIPair( "date_fmt", "jp" );
    p.appendCGIPair( "redirected", "1" );
    p.appendCGIPair( "result", "1" );
    p.appendCGIPair( "lang", "en" );
    p.appendCGIPair( "exch", "USD" );
    p.appendCGIPair( "exch2", "" );
    p.appendCGIPair( "expr2", "" );
    p.appendCGIPair( "format", "CSV" );
    p.appendCGIPair( "dest", "Get Table" );
    p.appendCGIPair( "sel_list", currencies );
    String parms = p.toString();

    System.out.println( "sending request to oanda.com. Please be
    patient." );

    // ask oanda for exchange rates on given list of currencies.
    String haystack = CGIget.get( host, relativeURL, parms );

    /* extract good stuff from the whole web page */
    String result = extractGoodies ( haystack );

    System.out.println( result );

    save ( result );
    }
    catch ( IOException e )
    {
    System.out.println(e);
    }
    } // end class Main
    } // end class Oanda


    // Class com/mindprod/voter/CGIget.java
    // copyright (c) 1998-2005 Roedy Green, Canadian Mind Products
    // based on work by Jonathan Revusky

    // To encode strings use java.net.URLEncoder.encode;
    // and Java.net.URLDecoder.decode or CGIRequest.
    /*
    * copyright (c) 1998-2005
    * Roedy Green
    * Canadian Mind Products
    * #327 - 964 Heywood Avenue
    * Victoria, BC Canada V8V 2Y5
    * tel: (250) 361-9093
    * mailto:
    * http://mindprod.com
    */
    // Version 1.0

    package com.mindprod.voter;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io_OutputStream;
    import java.net.HttpURLConnection;
    import java.net.Socket;
    import java.net.URL;

    /**
    * simulates a browser posting a form to CGI via GET.
    */
    public class CGIget
    {
    /**
    * Static only. Prevent instantiation.
    */
    private CGIget()
    {
    }
    /**
    * Send a formful of data to the CGI host using GET.
    *
    * @param websiteURL URL of the website
    * @param relativeURL
    * relative URL of the document/CGI desired
    * Absolute begin with /.
    *
    * @param parms parms to send, encoded with URLEncoder
    *
    * @return CGI host's response with headers and embedded length
    fields stripped
    * @exception IOException
    */
    public static String get( String websiteURL, String relativeURL,
    String parms ) throws IOException
    {
    // O P E N

    URL url = new URL ( "http://" + websiteURL + '/' + relativeURL +
    '?' + parms );
    HttpURLConnection urlc =
    (HttpURLConnection)url.openConnection();
    urlc.setAllowUserInteraction( false );
    urlc.setDoInput( true );
    urlc.setDoOutput( false );
    urlc.setUseCaches( false );
    urlc.setRequestMethod( "GET" );
    urlc.connect();
    InputStream is = urlc.getInputStream();

    int statusCode;
    statusCode = urlc.getResponseCode();
    // get size of message. -1 means comes in an indeterminate
    number of chunks.
    int estimatedLength = (int)urlc.getContentLength();
    if ( estimatedLength < 0 )
    {
    estimatedLength = 32*1024;
    }

    // R E A D
    String result = readEverything( is, estimatedLength );
    // C L O S E
    is.close();
    urlc.disconnect();

    return result;
    } // end get



    /**
    * Send a formful of data to the CGI host using GET.
    *
    * @param websiteURL URL of the website
    * @param relativeURL
    * relative URL of the document/CGI desired
    * Absolute begin with /.
    * @param parms parms to send, encoded with URLEncoder
    * @return CGI host's response, raw, everything incuding headers
    and embedded length fields
    * @exception IOException
    */
    public static String getRaw( String websiteURL, String relativeURL,
    String parms ) throws IOException
    {

    URL url = new URL( "http://" + websiteURL );
    int port = url.getPort();
    if ( port == -1 ) port = 80;
    Socket sock = new Socket( websiteURL, port );

    // Obtain data streams
    OutputStream os = sock.getOutputStream();
    InputStream is = sock.getInputStream();


    StringBuffer sb = new StringBuffer( 1000 );
    sb.append( "GET" );
    sb.append( " " );
    sb.append( relativeURL );

    if ( parms.length() > 0 )
    {
    sb.append( "?");
    sb.append( parms );
    }

    sb.append( " " );
    sb.append( "HTTP/1.1\n" );
    sb.append( "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0;
    Windows NT 5.0) Opera 7.11 [en]\n" );

    sb.append( "Host: " );
    sb.append( websiteURL );
    sb.append( "\n" );

    sb.append( "Accept: text/html, image/png, image/jpeg, image/gif,
    image/x-xbitmap, */*;q=0.1\n" );
    sb.append( "Accept-Language: en\n" );
    sb.append( "Accept-Charset: windows-1252, utf-8, utf-16,
    iso-8859-1;q=0.6, *;q=0.1\n" );
    sb.append( "Accept-Encoding: deflate, gzip, x-gzip, identity,
    *;q=0\n" );

    // Referer: could go here.
    // cookies could go here.

    sb.append( "Connection: Keep-Alive, TE\n" );
    sb.append( "TE: deflate, gzip, chunked, identity, trailers\n" );
    sb.append( "Content-type: application/x-www-form-urlencoded\n"
    );
    sb.append( "\n" );

    String header = sb.toString();

    os.write( header.getBytes( "8859_1" /* encoding */ ) );
    os.close();

    // Read data FROM server till -1 eof
    // get everything. headers, embedded length counts etc.
    String result = readEverything ( is, 32*1024 );

    is.close();
    return result;
    } // end getRaw

    /**
    * Used to read until EOF on an Inputstream that
    * sometimes returns 0 bytes because data have
    * not arrived yet. Does not close the stream.
    *
    * @param is InputStream to read from.
    * @param estimatedLength
    * Estimated number of bytes that will be read.
    * -1 or 0 mean you have no idea. Best to make
    * some sort of guess a little on the high side.
    * @return String representing the contents of the entire
    * stream.
    */
    public static String readEverything( InputStream is, int
    estimatedLength ) throws IOException
    {
    if ( estimatedLength <= 0 )
    {
    estimatedLength = 10*1024;
    }

    StringBuffer buf = new StringBuffer( estimatedLength );

    int chunkSize = Math.min ( estimatedLength, 4*1024 );
    byte[] ba = new byte[ chunkSize ];

    // -1 means eof, 0 means none available for now.
    int bytesRead;

    while ( ( bytesRead = is.read( ba, 0, chunkSize )) >= 0 )
    {
    if ( bytesRead == 0 )
    {
    try
    {
    // no data for now
    // wait a while before trying again to see if data has
    arrived.
    // avoid hogging cpu in a tight loop
    Thread.sleep( 100 );
    }
    catch ( InterruptedException e )
    {
    Thread.currentThread().interrupt();
    }
    }
    else
    {
    // got some data
    buf.append( new String( ba, 0, bytesRead , "8859_1" /*
    encoding */) );
    }
    }
    return buf.toString();
    } // end readEverything

    /**
    * Reads exactly len bytes from the input stream
    * into the byte array. This method reads repeatedly from the
    * underlying stream until all the bytes are read.
    * InputStream.read is often documented to block like this, but in
    actuality it
    * does not always do so, and returns early with just a few bytes.
    * readBlocking blocks until all the bytes are read,
    * the end of the stream is detected,
    * or an exception is thrown. You will always get as many bytes as
    you
    * asked for unless you get an eof or other exception.
    * Unlike readFully, you find out how many bytes you did get.
    *
    * @param b the buffer into which the data is read.
    * @param off the start offset of the data in the array,
    * not offset into the file!
    * @param len the number of bytes to read.
    * @return number of bytes actually read.
    * @exception IOException if an I/O error occurs.
    *
    */
    public static final int readBlocking ( InputStream in , byte b[ ],
    int off, int len ) throws IOException
    {
    int totalBytesRead = 0;
    while ( totalBytesRead < len )
    {
    int bytesRead = in.read( b , off + totalBytesRead , len -
    totalBytesRead );
    if ( bytesRead < 0 )
    {
    break;
    }
    if ( bytesRead == 0 )
    {
    try
    {
    // no data for now
    // wait a while before trying again to see if data has
    arrived.
    // avoid hogging cpu in a tight loop
    Thread.sleep( 100 );
    }
    catch ( InterruptedException e )
    {
    Thread.currentThread().interrupt();
    }
    }
    else
    {
    totalBytesRead += bytesRead;
    }
    }
    return totalBytesRead;
    }
    // end readBlocking

    } // end class CGIget

    package com.mindprod.voter;
    import java.io.UnsupportedEncodingException;
    import java.net.URLEncoder;

    /**
    * Like A StringBuffer but encodes CGI pairs.
    *
    * @author Roedy Green
    * @version 1.0
    * @since 2003-05-26
    */
    public class CGIRequest
    {
    // ideally would extend StringBuffer, but it is final.

    /**
    * constructor
    *
    * @param size estimated size of result string.
    */
    public CGIRequest ( int size )
    {
    this.sb = new StringBuffer (size);
    }

    private final StringBuffer sb;

    /**
    * append a parm=value pair of CGI parameters,
    * ecoding them with URL encoding, xxx=yyy&aaa=bbb etc.
    *
    * @param name parameter name
    *
    * @param value parameter value
    */
    public void appendCGIPair ( String name, String value )
    {
    if ( sb.length() != 0 )
    {
    // separates pairs
    sb.append( '&' );
    }
    try
    {
    sb.append( URLEncoder.encode( name , "ASCII" ) );
    sb.append ( '=' );
    sb.append( URLEncoder.encode( value, "ASCII" ) );
    }
    catch ( UnsupportedEncodingException e )
    {
    throw new IllegalArgumentException("ASCII encoding support
    missing");
    }

    }

    /**
    * get request as an a URL-encoded String.
    *
    * @return result CGI request string.
    */
    public String toString()
    {
    return sb.toString();
    }
    }




    --
    Bush crime family lost/embezzled $3 trillion from Pentagon.
    Complicit Bush-friendly media keeps mum. Rumsfeld confesses on video.
    http://www.infowars.com/articles/us/mckinney_grills_rumsfeld.htm

    Canadian Mind Products, Roedy Green.
    See http://mindprod.com/iraq.html photos of Bush's war crimes
     
    Roedy Green, Jun 30, 2005
    #10
  11. Andrew Thompson, Jun 30, 2005
    #11
  12. Paul Battersby wrote:
    >
    > If you can see something simple that I'm doing wrong, that would be great.


    I'm actually starting to think you were right.
    Very strange, with a simple telnet connection it works perfectly. And
    the URLConnection doesn't have any header. Very strange.
    What the hell, a telnet works. It can't be a missing header
     
    Andrea Desole, Jun 30, 2005
    #12
  13. Okay, I noticed that Google gives a very nice error message, saying that
    the client is not allowed. So, I don't know what is wrong with the Java
    class (there must be a way to find out it's coming from Java code; you
    should look at the JDK source), but you can definitely get a result
    writing a telnet connection in Java, which is not difficult.
    Before you do this, however, I should point out that automated queries
    are against the terms of service (link courtesy of the error message):

    http://www.google.com/terms_of_service.html
     
    Andrea Desole, Jun 30, 2005
    #13
  14. Paul Battersby

    sks Guest

    "Andrea Desole" <> wrote in message
    news:a445c$42c410fd$d468cb3c$-service.com...
    >
    >
    > Paul Battersby wrote:
    >>
    >> If you can see something simple that I'm doing wrong, that would be
    >> great.

    >
    > I'm actually starting to think you were right.
    > Very strange, with a simple telnet connection it works perfectly. And the
    > URLConnection doesn't have any header. Very strange.
    > What the hell, a telnet works. It can't be a missing header


    Maybe the URLConn adds some headers by default, such as user-agent: Java or
    something which google is rejecting.
     
    sks, Jun 30, 2005
    #14
  15. In article <1goa3xe9q7v66$>,
    Andrew Thompson <> wrote:
    >On Thu, 30 Jun 2005 15:21:33 GMT, Roedy Green wrote:
    >
    >> Use http://mindprod.com/jgloss/ethereal.com ..

    >
    >'404'
    >
    >It is lucky your site is so organised Roedy..


    That's not luck -- that's hard work!
    --
    "Yo' ideas need to be thinked befo' they are say'd" - Ian Lamb, age 3.5
    http://www.cs.queensu.ca/~dalamb/ qucis->cs to reply (it's a long story...)
     
    David Alex Lamb, Jun 30, 2005
    #15
  16. sks wrote:
    >
    > Maybe the URLConn adds some headers by default, such as user-agent: Java or
    > something which google is rejecting.


    that's what I thought, but I dumped all the request parameters, and the
    list turned out to be empty. If the class does it, it hides it.
    I would really like to know how they do it.
     
    Andrea Desole, Jun 30, 2005
    #16
  17. Andrea Desole wrote:
    >
    > that's what I thought, but I dumped all the request parameters, and the
    > list turned out to be empty. If the class does it, it hides it.
    > I would really like to know how they do it.


    sorry, I meant request headers, as specified in
    URLConnection.getRequestProperties()
     
    Andrea Desole, Jun 30, 2005
    #17
  18. Paul Battersby

    Bill Tschumy Guest

    On Thu, 30 Jun 2005 09:59:11 -0500, Tor Iver Wilhelmsen wrote
    (in article <>):

    > "Paul Battersby" <> writes:
    >
    >> I figure I need to pass some sort of header information or something so that
    >> I appear to be a browser.

    >
    > So you are looking for URLConnection.setRequestProperty("User-Agent",
    > "some browser string").


    I have a product called Parsnips that uses Java to download URLs and index
    them. It works fine with the URL you gave.

    Here is the code snippet I use:

    System.setProperty("sun.net.client.defaultConnectTimeout",
    "10000");
    System.setProperty("sun.net.client.defaultReadTimeout", "10000");
    System.setProperty("http.agent", "Parsnips/" + Parsnips.CURRENT_VERSION + "
    (" + System.getProperty("os.name") + ")");

    URL url = new URL(urlStr);
    BufferedReader urlReader = new BufferedReader(new
    InputStreamReader(url.openStream()));
    callback = new ParserCallback(url, null);
    new ParserDelegator().parse(urlReader, callback, true);
    urlReader.close();

    As I say, this is working for me. You will probably want to replace the
    ParserCalback stuff with some other way to read the URL stream.

    --
    Bill Tschumy
    Otherwise -- Austin, TX
    http://www.otherwise.com
     
    Bill Tschumy, Jun 30, 2005
    #18
  19. On 30 Jun 2005 16:33:34 GMT, David Alex Lamb wrote:

    > Andrew Thompson <> wrote:

    ....
    >>It is lucky your site is so organised Roedy..

    >
    > That's not luck --


    I did mean lucky for the users..

    >...that's hard work!


    True. As if putting the information there in the
    first place were not hard enough, I find checking
    and updating pages/sites takes a lot *more* work
    in the long run.

    --
    Andrew Thompson
    http://www.PhySci.org/codes/ Web & IT Help
    http://www.PhySci.org/ Open-source software suite
    http://www.1point1C.org/ Science & Technology
    http://www.LensEscapes.com/ Images that escape the mundane
     
    Andrew Thompson, Jun 30, 2005
    #19
  20. And it seems we have a winner!

    Adding this line:

    System.setProperty("http.agent", "Test/1.0" + "(" +
    System.getProperty("os.name") + ")");

    to the sample code I posted previously, solves the problem.

    THANKS!!!!!!!!

    "Bill Tschumy" <> wrote in message
    news:zqVwe.1242$...

    >
    > I have a product called Parsnips that uses Java to download URLs and index
    > them. It works fine with the URL you gave.
    >
    > Here is the code snippet I use:
    >
    > System.setProperty("sun.net.client.defaultConnectTimeout",
    > "10000");
    > System.setProperty("sun.net.client.defaultReadTimeout", "10000");
    > System.setProperty("http.agent", "Parsnips/" + Parsnips.CURRENT_VERSION +

    "
    > (" + System.getProperty("os.name") + ")");
    >
    > URL url = new URL(urlStr);
    > BufferedReader urlReader = new BufferedReader(new
    > InputStreamReader(url.openStream()));
    > callback = new ParserCallback(url, null);
    > new ParserDelegator().parse(urlReader, callback, true);
    > urlReader.close();
    >
    > As I say, this is working for me. You will probably want to replace the
    > ParserCalback stuff with some other way to read the URL stream.
    >
    > --
    > Bill Tschumy
    > Otherwise -- Austin, TX
    > http://www.otherwise.com
    >
     
    Paul Battersby, Jun 30, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ryan Taylor
    Replies:
    2
    Views:
    953
    Ryan Taylor
    Nov 10, 2004
  2. Steve C. Orr [MVP, MCSD]
    Replies:
    0
    Views:
    1,625
    Steve C. Orr [MVP, MCSD]
    Mar 7, 2005
  3. Phoenix

    HTML Challenge

    Phoenix, Oct 18, 2005, in forum: HTML
    Replies:
    23
    Views:
    1,002
    Neredbojias
    Nov 16, 2005
  4. Brett  Kelly
    Replies:
    1
    Views:
    684
    Steve C. Orr [MVP, MCSD]
    Jun 16, 2006
  5. Max Metral

    HTML regex challenge

    Max Metral, Jul 24, 2004, in forum: Perl Misc
    Replies:
    5
    Views:
    139
    Tore Aursand
    Jul 25, 2004
Loading...

Share This Page