Scraping Wikipedia with Python

Dotan Cohen · Aug 11, 2009

I plan on making a geography-learning Anki [1] deck, and Wikipedia has
the information that I need in nicely formatted tables on the side of
each country's page. Has someone already invented a wheel to parse and
store that data (scrape)? It is probably not difficult to code, and
within the Wikipedia license, but if that wheel has already been
invented then I don't want to redo it. I tried googling for a
Wikipedia-specific solution but found none, is there a general purpose
solution that I could use?

Note that I am a regular Wikipedia contributor and plan on staying
within the realm of Wikipedia's rules.

[1] http://ichi2.net/anki/

John Nagle · Aug 11, 2009

Dotan said:
I plan on making a geography-learning Anki [1] deck, and Wikipedia has
the information that I need in nicely formatted tables on the side of
each country's page. Has someone already invented a wheel to parse and
store that data (scrape)?

Wikipedia has an API for computer access. See

http://www.mediawiki.org/wiki/API

John Nagle

Dotan Cohen · Aug 11, 2009

Try reading a little there! Starting there I went to

http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot

where I found a section on existing bots, comments on how the "scraping"
is not what you want, and even a Python section with a link to something
labelled Â PyWikipediaBot...

Thanks. I read the first bit of that page, but did not finish it.
Grepping it for Python led to to what I need.

Sorry for the noise.

Paul Rubin · Aug 11, 2009

Dotan Cohen said:
Thanks. I read the first bit of that page, but did not finish it.
Grepping it for Python led to to what I need.

maybe you want dbpedia.

Thorsten Kampe · Aug 12, 2009

* Dotan Cohen (Tue, 11 Aug 2009 21:29:40 +0300)

Yes, I am aware of this as well. Does anyone know of a python class
for easily interacting with it, or do I need to roll my own.

http://pypi.python.org/pypi?:action=search&term=wikipedia ?

Thorsten

Dotan Cohen · Aug 12, 2009

maybe you want dbpedia.

I did not know about this. Thanks!

That is the reason why I ask. This list has an unbelievable collective
knowledge and I am certain that asking "how much is 2+2" would net an
insightful answer that would teach me something.

Thank you, Paul, and thank you to the entire Python list!

Dotan Cohen · Aug 12, 2009

http://pypi.python.org/pypi?:action=search&term=wikipedia ?
Thanks, Thorsten, I will go through those. I did not know about that
resource, I am not a regular coder. One more resource to add to the
toolbox!

Paul Rubin · Aug 13, 2009

Dotan Cohen said:
I did not know about this. Thanks!

You might also like freebase/metaweb.

Andre Engels · Aug 13, 2009

Try reading a little there! Starting there I went to

http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot

where I found a section on existing bots, comments on how the "scraping"
is not what you want, and even a Python section with a link to something
labelled Â PyWikipediaBot...

Some information on using the PyWikipediaBot for scraping from someone
who used to program on the bot (and occasionally still does):

To make the framework work, you need to add a file user-config.py with
the following contents:

family = 'wikipedia'
mylang = 'en'

If you want to use the bot to also edit pages on wikipedia, you will
have to add:

usernames['wikipedia']['en'] = <the username of your bot>

If you work on another language of course you use that language's
abbreviation instead of en.

The heart of the framework is the file wikipedia.py, you need to
import that one. It contains two important classes: Page and Site,
which represent a wikipedia page and the site as a whole,
respectively.

It is best to put your code in a try like this:

try:
mysite = wikipedia.getSite()
<your code here>
finally:
wikipedia.stopme()

The stopme() functionality has to do with the bot's behaviour to avoid
over-feeding the server with requests. It has a certain time (default
is 10 seconds) between two requests, but if you have several bots
running, it will lengthen this time. stopme() tells that the bot is
not running any more, so other runs are not delayed by it.
wikipedia.getSite() gets the site object for your default site (if the
settings above are chosen it is the English language Wikipedia).

Still with me? Good, because now we get into the real programming.

The Page class has as its __init__:
def __init__(self, site, title, insite=None, defaultNamespace=0):

site is here the wiki on which the page exists (usually this will be
mysite, which is why I defined it above), title the title of the page.
The optional parameters are for special usage.

The Page class has a number of methods, which you can find in the
file, but some of the most important are:
page.title() - the title of the page
page.site() - the wiki the page is on
page.get() - the (wiki) text of the page
page.put(text) - saves the page with 'text' as its new content. An
important optional parameter is 'comment', which specifies the summary
that is given with the change
page.exists() - a boolean, true if the page exists, false otherwise
page.linkedPages() - a list of Page objects, being the pages the page links to

However, instead of page.get() it is advisable to use:

wikipedia.getall(site,pages)

with 'site' being a Site object (e.g. mysite) and pages a list (or
more generally, iterable) of Page objects. It will get all pages in
the list using a single call to the wiki, thus speeding up your bot
and at the same time reducing its load on the wiki. Once a page has
been loaded (either through get or through getall), subsequent calls
to page.get() will not reload it. Thus, the normal way of working is
to create a list of pages one is interested in, use getall (in groups
of 60 or so) to load them, then use get to work with them.

Another useful file in the framework is pagegenerators. It provides a
number of generators that yield Page objects. Some interesting ones
(check the code for the exact parameters):

AllpagesPageGenerator: generates all pages of the wiki, alphabetically
from a specified begin
ReferringPageGenerator: all pages linking to a given page
CategorizedPageGenerator: all pages in a given directory
LinkedPageGenerator: all pages linked to from a given page

Other generators are used by 'wrapping them around' a given generator.
The most important of these is the PreloadingGenerator, which ensures
that the page are preloaded (using wikipedia.getall) in groups.

A simple way to use the bot framework to scrape all pages of the
English Wikipedia (warning: This takes a few days!) would be:

import wikipedia
import pagegenerators

basicgen = pagegenerators.AllpagesPageGenerator(includeredirects = False)
generator = pagegenerators.PreloadingGenerator(basicgen, 200)
for page in generator:
title = page.title()
text = page.get()
<do whatever you want with title and text>

Script to fetch Wikipedia text	4	Oct 11, 2006
python shell silently ignores termios.tcsetattr()	4	Oct 20, 2010
calling python from lisp	1	Oct 29, 2008
A Friday Python Programming Pearl: random sampling	2	May 28, 2010
executables no longer executing on PC's without Python installed	0	Dec 16, 2008
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Old Paranoia Game in Python	15	Jan 9, 2005
Travelling Salesman with Spherical Coordinates.	12	Apr 30, 2006

Scraping Wikipedia with Python

Dotan Cohen

John Nagle

Dotan Cohen

Paul Rubin

Thorsten Kampe

Dotan Cohen

Dotan Cohen

Paul Rubin

Andre Engels

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads