Scraping Wikipedia with Python

Discussion in 'Python' started by Dotan Cohen, Aug 11, 2009.

  1. Dotan Cohen

    Dotan Cohen Guest

    I plan on making a geography-learning Anki [1] deck, and Wikipedia has
    the information that I need in nicely formatted tables on the side of
    each country's page. Has someone already invented a wheel to parse and
    store that data (scrape)? It is probably not difficult to code, and
    within the Wikipedia license, but if that wheel has already been
    invented then I don't want to redo it. I tried googling for a
    Wikipedia-specific solution but found none, is there a general purpose
    solution that I could use?

    Note that I am a regular Wikipedia contributor and plan on staying
    within the realm of Wikipedia's rules.


    [1] http://ichi2.net/anki/

    --
    Dotan Cohen

    http://what-is-what.com
    http://gibberish.co.il
     
    Dotan Cohen, Aug 11, 2009
    #1
    1. Advertising

  2. Dotan Cohen

    John Nagle Guest

    Dotan Cohen wrote:
    > I plan on making a geography-learning Anki [1] deck, and Wikipedia has
    > the information that I need in nicely formatted tables on the side of
    > each country's page. Has someone already invented a wheel to parse and
    > store that data (scrape)?


    Wikipedia has an API for computer access. See

    http://www.mediawiki.org/wiki/API

    John Nagle
     
    John Nagle, Aug 11, 2009
    #2
    1. Advertising

  3. Dotan Cohen

    Dotan Cohen Guest

    > Try reading a little there! Starting there I went to
    >
    > http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot
    >
    > where I found a section on existing bots, comments on how the "scraping"
    > is not what you want, and even a Python section with a link to something
    > labelled  PyWikipediaBot...
    >


    Thanks. I read the first bit of that page, but did not finish it.
    Grepping it for Python led to to what I need.

    Sorry for the noise.


    --
    Dotan Cohen

    http://what-is-what.com
    http://gibberish.co.il
     
    Dotan Cohen, Aug 11, 2009
    #3
  4. Dotan Cohen

    Paul Rubin Guest

    Dotan Cohen <> writes:
    > Thanks. I read the first bit of that page, but did not finish it.
    > Grepping it for Python led to to what I need.


    maybe you want dbpedia.
     
    Paul Rubin, Aug 11, 2009
    #4
  5. Thorsten Kampe, Aug 12, 2009
    #5
  6. Dotan Cohen

    Dotan Cohen Guest

    > maybe you want dbpedia.

    I did not know about this. Thanks!

    That is the reason why I ask. This list has an unbelievable collective
    knowledge and I am certain that asking "how much is 2+2" would net an
    insightful answer that would teach me something.

    Thank you, Paul, and thank you to the entire Python list!

    --
    Dotan Cohen

    http://what-is-what.com
    http://gibberish.co.il
     
    Dotan Cohen, Aug 12, 2009
    #6
  7. Dotan Cohen

    Dotan Cohen Guest

    Dotan Cohen, Aug 12, 2009
    #7
  8. Dotan Cohen

    Paul Rubin Guest

    Dotan Cohen <> writes:
    > > maybe you want dbpedia.

    > I did not know about this. Thanks!


    You might also like freebase/metaweb.
     
    Paul Rubin, Aug 13, 2009
    #8
  9. Dotan Cohen

    Andre Engels Guest

    On Tue, Aug 11, 2009 at 8:53 PM, David C Ullrich<> wrote:

    > Try reading a little there! Starting there I went to
    >
    > http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot
    >
    > where I found a section on existing bots, comments on how the "scraping"
    > is not what you want, and even a Python section with a link to something
    > labelled  PyWikipediaBot...


    Some information on using the PyWikipediaBot for scraping from someone
    who used to program on the bot (and occasionally still does):

    To make the framework work, you need to add a file user-config.py with
    the following contents:

    family = 'wikipedia'
    mylang = 'en'

    If you want to use the bot to also edit pages on wikipedia, you will
    have to add:

    usernames['wikipedia']['en'] = <the username of your bot>

    If you work on another language of course you use that language's
    abbreviation instead of en.

    The heart of the framework is the file wikipedia.py, you need to
    import that one. It contains two important classes: Page and Site,
    which represent a wikipedia page and the site as a whole,
    respectively.

    It is best to put your code in a try like this:

    try:
    mysite = wikipedia.getSite()
    <your code here>
    finally:
    wikipedia.stopme()

    The stopme() functionality has to do with the bot's behaviour to avoid
    over-feeding the server with requests. It has a certain time (default
    is 10 seconds) between two requests, but if you have several bots
    running, it will lengthen this time. stopme() tells that the bot is
    not running any more, so other runs are not delayed by it.
    wikipedia.getSite() gets the site object for your default site (if the
    settings above are chosen it is the English language Wikipedia).

    Still with me? Good, because now we get into the real programming.

    The Page class has as its __init__:
    def __init__(self, site, title, insite=None, defaultNamespace=0):

    site is here the wiki on which the page exists (usually this will be
    mysite, which is why I defined it above), title the title of the page.
    The optional parameters are for special usage.

    The Page class has a number of methods, which you can find in the
    file, but some of the most important are:
    page.title() - the title of the page
    page.site() - the wiki the page is on
    page.get() - the (wiki) text of the page
    page.put(text) - saves the page with 'text' as its new content. An
    important optional parameter is 'comment', which specifies the summary
    that is given with the change
    page.exists() - a boolean, true if the page exists, false otherwise
    page.linkedPages() - a list of Page objects, being the pages the page links to

    However, instead of page.get() it is advisable to use:

    wikipedia.getall(site,pages)

    with 'site' being a Site object (e.g. mysite) and pages a list (or
    more generally, iterable) of Page objects. It will get all pages in
    the list using a single call to the wiki, thus speeding up your bot
    and at the same time reducing its load on the wiki. Once a page has
    been loaded (either through get or through getall), subsequent calls
    to page.get() will not reload it. Thus, the normal way of working is
    to create a list of pages one is interested in, use getall (in groups
    of 60 or so) to load them, then use get to work with them.

    Another useful file in the framework is pagegenerators. It provides a
    number of generators that yield Page objects. Some interesting ones
    (check the code for the exact parameters):

    AllpagesPageGenerator: generates all pages of the wiki, alphabetically
    from a specified begin
    ReferringPageGenerator: all pages linking to a given page
    CategorizedPageGenerator: all pages in a given directory
    LinkedPageGenerator: all pages linked to from a given page

    Other generators are used by 'wrapping them around' a given generator.
    The most important of these is the PreloadingGenerator, which ensures
    that the page are preloaded (using wikipedia.getall) in groups.

    A simple way to use the bot framework to scrape all pages of the
    English Wikipedia (warning: This takes a few days!) would be:

    import wikipedia
    import pagegenerators

    basicgen = pagegenerators.AllpagesPageGenerator(includeredirects = False)
    generator = pagegenerators.PreloadingGenerator(basicgen, 200)
    for page in generator:
    title = page.title()
    text = page.get()
    <do whatever you want with title and text>

    --
    André Engels,
     
    Andre Engels, Aug 13, 2009
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Jones

    Web Scraping/Site Scraping

    David Jones, Jul 11, 2004, in forum: Python
    Replies:
    4
    Views:
    531
    Andrew Bennetts
    Jul 13, 2004
  2. Paddy
    Replies:
    11
    Views:
    617
    Paul Rubin
    Mar 23, 2007
  3. John J. Lee
    Replies:
    27
    Views:
    625
  4. John Nagle
    Replies:
    5
    Views:
    1,111
    Nikita the Spider
    Oct 4, 2007
  5. Paddy
    Replies:
    23
    Views:
    803
    Eduardo O. Padoan
    Feb 2, 2008
Loading...

Share This Page