Re: funny (strange behavior)link

Discussion in 'HTML' started by Harlan Messinger, Jul 15, 2008.

  1. Eric wrote:
    > I'm using a script to download intel architecture specs every couple of months
    > so I'll always have the current docs.
    > Go here:
    > http://www.intel.com/products/processor/manuals/
    > scroll down to the bottom of the page, the link for
    > "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
    > points to "http://www.intel.com/design/processor/manuals/248966.pdf"
    > which i can download by clicking on the link, but in my script using wget,
    > i always get a 403 error - why?
    >
    > Example:
    > wget http://www.intel.com/design/processor/manuals/248966.pdf -O
    > Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248966.pdf
    > --10:10:04-- http://www.intel.com/design/processor/manuals/248966.pdf
    > =>
    > `Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248966.pdf'
    > Resolving www.intel.com... 64.209.118.114, 64.209.118.105
    > Connecting to www.intel.com|64.209.118.114|:80... connected.
    > HTTP request sent, awaiting response... 403 Forbidden
    > 10:10:04 ERROR 403: Forbidden.


    I'm guessing that they filter out requests from automated or automatable
    tools like wget, googlebot, linkchecker, and so on to conserve on server
    load and bandwidth.
     
    Harlan Messinger, Jul 15, 2008
    #1
    1. Advertising

  2. On 2008-07-15, Eric wrote:
    > Harlan Messinger wrote:
    >
    >> Eric wrote:
    >>> I'm using a script to download intel architecture specs every couple of
    >>> months so I'll always have the current docs.
    >>> Go here:
    >>> http://www.intel.com/products/processor/manuals/
    >>> scroll down to the bottom of the page, the link for
    >>> "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
    >>> points to "http://www.intel.com/design/processor/manuals/248966.pdf"
    >>> which i can download by clicking on the link, but in my script using wget,
    >>> i always get a 403 error - why?
    >>>
    >>> Example:
    >>> wget http://www.intel.com/design/processor/manuals/248966.pdf -O
    >>> Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248966.pdf
    >>> --10:10:04-- http://www.intel.com/design/processor/manuals/248966.pdf
    >>> =>
    >>> `Intel_64_and_IA-32_Architectures_Optimization_Reference_Manual_248966.pdf'
    >>> Resolving www.intel.com... 64.209.118.114, 64.209.118.105
    >>> Connecting to www.intel.com|64.209.118.114|:80... connected.
    >>> HTTP request sent, awaiting response... 403 Forbidden
    >>> 10:10:04 ERROR 403: Forbidden.

    >>
    >> I'm guessing that they filter out requests from automated or automatable
    >> tools like wget, googlebot, linkchecker, and so on to conserve on server
    >> load and bandwidth.

    >
    > I wonder how they distinguish wget from a real browser, or if I can get around
    > it.


    You can change the user agent string:

    -U agent-string
    --user-agent=agent-string
    Identify as agent-string to the HTTP server.

    The HTTP protocol allows the clients to identify themselves using a
    "User-Agent" header field. This enables distinguishing the WWW
    software, usually for statistical purposes or for tracing of proto-
    col violations. Wget normally identifies as Wget/version, version
    being the current version number of Wget.

    However, some sites have been known to impose the policy of tailor-
    ing the output according to the "User-Agent"-supplied information.
    While this is not such a bad idea in theory, it has been abused by
    servers denying information to clients other than (historically)
    Netscape or, more frequently, Microsoft Internet Explorer. This
    option allows you to change the "User-Agent" line issued by Wget.
    Use of this option is discouraged, unless you really know what you
    are doing.

    Specifying empty user agent with --user-agent="" instructs Wget not
    to send the "User-Agent" header in HTTP requests.



    --
    Chris F.A. Johnson, webmaster <http://Woodbine-Gerrard.com>
    ===================================================================
    Author:
    Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
     
    Chris F.A. Johnson, Jul 16, 2008
    #2
    1. Advertising

  3. Harlan Messinger

    Guest

    On Jul 15, 11:32 pm, Eric <> wrote:
    > Harlan Messinger wrote:
    > > I'm guessing that they filter out requests from automated or automatable
    > > tools like wget, googlebot, linkchecker, and so on to conserve on server
    > > load and bandwidth.

    >
    > I wonder how they distinguish wget from a real browser, or if I can get around
    > it.


    Yes, they can see from the user-agent that it is not
    a browser and block wget requests or other
    type of HTTP requests that might come from bots.
    But
    did you try to contact the intel.com website, in case
    they have APIs or RSS feeds that can be
    downloaded with wget or by other automated
    HTTP requests?
     
    , Jul 17, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Don Tuttle
    Replies:
    2
    Views:
    462
    Don Tuttle
    Oct 13, 2003
  2. Brandon

    Funny Entity Name

    Brandon, Dec 7, 2005, in forum: VHDL
    Replies:
    2
    Views:
    525
    Paul Burke
    Dec 13, 2005
  3. Kevin Spencer

    Re: Link Link Link DANGER WILL ROBINSON!!!

    Kevin Spencer, May 17, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    901
    Kevin Spencer
    May 17, 2005
  4. Mantorok Redgormor
    Replies:
    70
    Views:
    1,822
    Dan Pop
    Feb 17, 2004
  5. Giggle Girl

    funny (not haha) ondrag/onmousedown behavior

    Giggle Girl, May 9, 2006, in forum: Javascript
    Replies:
    6
    Views:
    193
    Giggle Girl
    May 10, 2006
Loading...

Share This Page