Scraping Wikipedia with Python

D

Dotan Cohen

I plan on making a geography-learning Anki [1] deck, and Wikipedia has
the information that I need in nicely formatted tables on the side of
each country's page. Has someone already invented a wheel to parse and
store that data (scrape)? It is probably not difficult to code, and
within the Wikipedia license, but if that wheel has already been
invented then I don't want to redo it. I tried googling for a
Wikipedia-specific solution but found none, is there a general purpose
solution that I could use?

Note that I am a regular Wikipedia contributor and plan on staying
within the realm of Wikipedia's rules.


[1] http://ichi2.net/anki/
 
J

John Nagle

Dotan said:
I plan on making a geography-learning Anki [1] deck, and Wikipedia has
the information that I need in nicely formatted tables on the side of
each country's page. Has someone already invented a wheel to parse and
store that data (scrape)?

Wikipedia has an API for computer access. See

http://www.mediawiki.org/wiki/API

John Nagle
 
D

Dotan Cohen

Try reading a little there! Starting there I went to
http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot

where I found a section on existing bots, comments on how the "scraping"
is not what you want, and even a Python section with a link to something
labelled  PyWikipediaBot...

Thanks. I read the first bit of that page, but did not finish it.
Grepping it for Python led to to what I need.

Sorry for the noise.
 
P

Paul Rubin

Dotan Cohen said:
Thanks. I read the first bit of that page, but did not finish it.
Grepping it for Python led to to what I need.

maybe you want dbpedia.
 
D

Dotan Cohen

maybe you want dbpedia.

I did not know about this. Thanks!

That is the reason why I ask. This list has an unbelievable collective
knowledge and I am certain that asking "how much is 2+2" would net an
insightful answer that would teach me something.

Thank you, Paul, and thank you to the entire Python list!
 
A

Andre Engels

Try reading a little there! Starting there I went to

http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot

where I found a section on existing bots, comments on how the "scraping"
is not what you want, and even a Python section with a link to something
labelled  PyWikipediaBot...

Some information on using the PyWikipediaBot for scraping from someone
who used to program on the bot (and occasionally still does):

To make the framework work, you need to add a file user-config.py with
the following contents:

family = 'wikipedia'
mylang = 'en'

If you want to use the bot to also edit pages on wikipedia, you will
have to add:

usernames['wikipedia']['en'] = <the username of your bot>

If you work on another language of course you use that language's
abbreviation instead of en.

The heart of the framework is the file wikipedia.py, you need to
import that one. It contains two important classes: Page and Site,
which represent a wikipedia page and the site as a whole,
respectively.

It is best to put your code in a try like this:

try:
mysite = wikipedia.getSite()
<your code here>
finally:
wikipedia.stopme()

The stopme() functionality has to do with the bot's behaviour to avoid
over-feeding the server with requests. It has a certain time (default
is 10 seconds) between two requests, but if you have several bots
running, it will lengthen this time. stopme() tells that the bot is
not running any more, so other runs are not delayed by it.
wikipedia.getSite() gets the site object for your default site (if the
settings above are chosen it is the English language Wikipedia).

Still with me? Good, because now we get into the real programming.

The Page class has as its __init__:
def __init__(self, site, title, insite=None, defaultNamespace=0):

site is here the wiki on which the page exists (usually this will be
mysite, which is why I defined it above), title the title of the page.
The optional parameters are for special usage.

The Page class has a number of methods, which you can find in the
file, but some of the most important are:
page.title() - the title of the page
page.site() - the wiki the page is on
page.get() - the (wiki) text of the page
page.put(text) - saves the page with 'text' as its new content. An
important optional parameter is 'comment', which specifies the summary
that is given with the change
page.exists() - a boolean, true if the page exists, false otherwise
page.linkedPages() - a list of Page objects, being the pages the page links to

However, instead of page.get() it is advisable to use:

wikipedia.getall(site,pages)

with 'site' being a Site object (e.g. mysite) and pages a list (or
more generally, iterable) of Page objects. It will get all pages in
the list using a single call to the wiki, thus speeding up your bot
and at the same time reducing its load on the wiki. Once a page has
been loaded (either through get or through getall), subsequent calls
to page.get() will not reload it. Thus, the normal way of working is
to create a list of pages one is interested in, use getall (in groups
of 60 or so) to load them, then use get to work with them.

Another useful file in the framework is pagegenerators. It provides a
number of generators that yield Page objects. Some interesting ones
(check the code for the exact parameters):

AllpagesPageGenerator: generates all pages of the wiki, alphabetically
from a specified begin
ReferringPageGenerator: all pages linking to a given page
CategorizedPageGenerator: all pages in a given directory
LinkedPageGenerator: all pages linked to from a given page

Other generators are used by 'wrapping them around' a given generator.
The most important of these is the PreloadingGenerator, which ensures
that the page are preloaded (using wikipedia.getall) in groups.

A simple way to use the bot framework to scrape all pages of the
English Wikipedia (warning: This takes a few days!) would be:

import wikipedia
import pagegenerators

basicgen = pagegenerators.AllpagesPageGenerator(includeredirects = False)
generator = pagegenerators.PreloadingGenerator(basicgen, 200)
for page in generator:
title = page.title()
text = page.get()
<do whatever you want with title and text>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top