get google scholar using python

Discussion in 'Python' started by রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€, Oct 1, 2012.

  1. If I am trying to access a google scholar search result using python, I
    get the following error(403):
    $ python
    Python 2.7.3 (default, Jul 24 2012, 10:05:38)
    [GCC 4.7.0 20120507 (Red Hat 4.7.0-5)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from HTMLParser import HTMLParser
    >>> import urllib2

    response = urllib2.urlopen('http://scholar.google.co.uk/scholar?q=albert
    +einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=')
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib64/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
    File "/usr/lib64/python2.7/urllib2.py", line 406, in open
    response = meth(req, response)
    File "/usr/lib64/python2.7/urllib2.py", line 519, in http_response
    'http', request, response, code, msg, hdrs)
    File "/usr/lib64/python2.7/urllib2.py", line 444, in error
    return self._call_chain(*args)
    File "/usr/lib64/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
    File "/usr/lib64/python2.7/urllib2.py", line 527, in
    http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
    urllib2.HTTPError: HTTP Error 403: Forbidden
    >>>


    Will you kindly explain me the way to get rid of this?
     
    রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€, Oct 1, 2012
    #1
    1. Advertising

  2. রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€

    Nick Cash Guest

    > urllib2.urlopen('http://scholar.google.co.uk/scholar?q=albert

    >...
    > urllib2.HTTPError: HTTP Error 403: Forbidden

    > >>>

    >
    > Will you kindly explain me the way to get rid of this?


    Looks like Google blocks non-browser user agents from retrieving this query. You *could* work around it by setting the User-Agent header to something fake that looks browser-ish, but you're almost certainly breaking Google's TOS if you do so.

    Should you really really want to, urllib2 makes it easy:
    urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))

    -Nick Cash
     
    Nick Cash, Oct 1, 2012
    #2
    1. Advertising

  3. On 2012-10-01, Nick Cash <> wrote:
    >> urllib2.urlopen('http://scholar.google.co.uk/scholar?q=albert
    >>...
    >> urllib2.HTTPError: HTTP Error 403: Forbidden
    >>
    >> Will you kindly explain me the way to get rid of this?

    >
    > Looks like Google blocks non-browser user agents from retrieving this
    > query. You *could* work around it by setting the User-Agent header to
    > something fake that looks browser-ish, but you're almost certainly
    > breaking Google's TOS if you do so.


    I don't know about that particular Google service, but Google often
    provides an API that's intended for use by non-browser programs.
    Those interfaces are usually both easier to use for the programmer and
    impose less load on the servers.

    --
    Grant Edwards grant.b.edwards Yow! I am deeply CONCERNED
    at and I want something GOOD
    gmail.com for BREAKFAST!
     
    Grant Edwards, Oct 1, 2012
    #3
  4. I know one more python app that do the same thing
    http://www.icir.org/christian/downloads/scholar.py

    and few other app(Mendeley desktop) for which I found an explanation:
    (from
    http://academia.stackexchange.com/questions/2567/api-eula-and-scraping-for-google-scholar )
    that:
    "I know how Mendley uses it: they require you to click a button for each
    individual search of Google Scholar. If they automatically did the
    Google Scholar meta-data search for each paper when you import a
    folder-full then they would violate the old Scholar EULA. That is why
    they make you click for each query: if each query is accompanied by a
    click and not part of some script or loop then it is in compliance with
    the old EULA."

    So, If I manage to use the User-Agent as shown by you, will I still
    violating the google EULA?

    This is my first try of scrapping HTML. So please help

    On Mon, 2012-10-01 at 16:51 +0000, Nick Cash wrote:
    > > urllib2.urlopen('http://scholar.google.co.uk/scholar?q=albert
    > >...
    > > urllib2.HTTPError: HTTP Error 403: Forbidden
    > > >>>

    > >
    > > Will you kindly explain me the way to get rid of this?

    >
    > Looks like Google blocks non-browser user agents from retrieving this query. You *could* work around it by setting the User-Agent header to something fake that looks browser-ish, but you're almost certainly breaking Google's TOS if you do so.
    >
    > Should you really really want to, urllib2 makes it easy:
    > urllib2.urlopen(urllib2.Request("http://scholar.google.co.uk/scholar?q=albert+einstein%2B1905&btnG=&hl=en&as_sdt=0%2C5&as_sdtp=", headers={"User-Agent":"Mozilla/5.0 Cheater/1.0"}))
    >
    > -Nick Cash
     
    রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€, Oct 1, 2012
    #4
  5. রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€

    Jerry Hill Guest

    On Mon, Oct 1, 2012 at 1:28 PM, রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€ <> wrote:
    > So, If I manage to use the User-Agent as shown by you, will I still
    > violating the google EULA?


    Very likely, yes. The overall Google Terms of Services
    (http://www.google.com/intl/en/policies/terms/) say "Don’t misuse our
    Services. For example, don’t interfere with our Services or try to
    access them using a method other than the interface and the
    instructions that we provide."

    The only method that Google appears to allow for accessing Scholar is
    via the web interface, and they explicitly block web scraping through
    that interface, as you discovered. It's true that you can get around
    their block, but I believe that doing so violates the terms of
    service.

    Google does not appear to offer an API to access Scholar
    programatically, nor do I see a more specific EULA or TOS for the
    Scholar service beyond that general TOS document.

    That said, I am not a lawyer. If you want legal advice, you'll need
    to pay a lawyer for that advice.

    --
    Jerry
     
    Jerry Hill, Oct 1, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andrew Thompson

    FAQ - references to Google/Google Groups

    Andrew Thompson, Jun 20, 2005, in forum: Java
    Replies:
    0
    Views:
    691
    Andrew Thompson
    Jun 20, 2005
  2. Replies:
    2
    Views:
    2,067
    Jeffrey Schwab
    Nov 28, 2005
  3. Gonsolo

    H-Index with Google Scholar

    Gonsolo, Feb 25, 2009, in forum: Python
    Replies:
    0
    Views:
    655
    Gonsolo
    Feb 25, 2009
  4. Replies:
    1
    Views:
    349
    David RF
    May 23, 2012
  5. রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€

    tidy to convert google scholar page in xml

    রà§à¦¦à§à¦° বà§à¦¯à¦¾à¦£à¦¾à¦°à§à¦œà§€, Oct 8, 2012, in forum: Python
    Replies:
    1
    Views:
    250
    Dave Angel
    Oct 8, 2012
Loading...

Share This Page