Wikipedia XML Dump

kevingloveruk · Jan 28, 2014

Hi

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want touse Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before Iget much further, I wanted to ask:

Is what I am trying to do actually feasible?

Are there any example programs or code snippets that would help me?

Any advice or guidance would be gratefully received.

Best regards,
Kevin Glover

Rustom Mody · Jan 28, 2014

Hi

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. BeforeI get much further, I wanted to ask:

Is what I am trying to do actually feasible?

Cant really visualize what youve got...
When you 'download' wikipedia what do you get?
One 40GB file?
A zillion files?
Some other database format?

Another point:
sax is painful to use compared to full lxml (dom)
But then sax is the only choice when files cross a certain size
Thats why the above question

Also you may want to explore nltk

Skip Montanaro · Jan 28, 2014

Another point:

sax is painful to use compared to full lxml (dom)
But then sax is the only choice when files cross a certain size
Thats why the above question

No matter what the choice of XML parser, I suspect you'll want to
convert it to some other form for processing.

Skip

Kevin Glover · Jan 28, 2014

Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?

Kevin

Burak Arslan · Jan 28, 2014

hi,

Thanks for the comments, guys. The Wikipedia download is a single XML document, 43.1GB. Any further thoughts?

in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
be your only option.

hth,
burak

alex23 · Jan 28, 2014

I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a count of the number of hits for each search. Typical phrase-searches might be "of the dog" and "dog's".

I have some limited prior programming experience (from many years ago) and I am currently learning Python from a course of YouTube tutorials. Before I get much further, I wanted to ask:

Is what I am trying to do actually feasible?

Rather than parsing through 40GB+ every time you need to do a search,
you should get better performance using an XML database which will allow
you to do queries directly on the xml data.

http://basex.org/ is one such db, and comes with a Python API:

http://docs.basex.org/wiki/Clients

Rustom Mody · Jan 28, 2014

hi,

in that case, http://lxml.de/tutorial.html#event-driven-parsing seems to
be your only option.

Further thoughts?? Just a combo of what Burak and Skip said:
I'd explore a thin veneer of even-driven lxml to get from 40 GB monolithic xml
to something (more) digestible to nltk

Pseudocode in the wikipedia	8	Apr 1, 2005
large xml file...	11	Aug 22, 2011
reading XML file using python	6	May 17, 2010
handling xml embedded within xml	0	May 18, 2008
searching and storing large quantities of xml!	7	Jan 16, 2010
xml to xhtml	3	Apr 1, 2009
Unicode characters, XML/RSS	1	Jul 31, 2008
Minimally intrusive XML editing using Python	9	Nov 18, 2009

Wikipedia XML Dump

kevingloveruk

Rustom Mody

Skip Montanaro

Kevin Glover

Burak Arslan

alex23

Rustom Mody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads